So, in the previous talk, we heard about how you can add lots of commands to, well, like
to fix the shortcomings of an interesting or dated model of version control.
In this talk, I want to convince you that you can actually...
Oh, this isn't working?
Yeah.
I didn't do anything.
So, I don't...
Oh, yeah, right, right, right.
So, as I was saying in the previous talk, we heard about how you can add commands to
all sorts of things and improve on the state of the art in version control.
In this talk, I want to convince you that we can actually get all, like most of the
stuff for free if we do...
If we start by doing some mathematics.
So, my entire life so far has been about doing really far stretches between deep PRE
or deep practice or between one domain and another.
I've been working on DNA nanotechnology as a theoretical computer scientist for a while.
Today, I'm working on a game theory and economics algorithms for electricity markets.
I'm also working...
Well, in general, I'm interested in distributed computing.
So, PIRUL is a byproduct of that interest.
So, in this talk, I'll first start with defining things.
There'll be some mathematics involved and also some practical stuff.
So, if you're not interested in mathematical parts, like bear with me, they won't be too
long hopefully.
All right.
So, I'll first start with defining what version control is for me, then talk about our solution
and what we've been working on, and then some cool implementation details or things that
I find cool at least.
And then, well, my talk title was a little bit of a bit of a bit of a bit of a bit of
So, I'll try to come back to that and comment on that choice by asking the question again.
Is this really post-get?
Okay.
So, version control.
Probably many people here use it and know exactly what it is.
Let's do it anyway.
It's one or more cool out there.
Is it a tree of documents concurrently?
And one thing that's super important is the asynchronous nature of the thing.
The co-authors can choose when they want to sync or merge.
And like reminding these definitions sounds like it's a little bit trivial, but it actually
matters a lot when you're deep down in mathematics and you're trying to understand what you're
doing and why you're doing it.
So, for example, how does version control differ from Google Docs, for example?
It is made conflicts.
So, that's a feature that's been often overlooked in current systems like Git, Mercurial, SVN,
CVS, RCS before it.
And finally, also, version control allows you to review a project's history.
So, that's a very important feature as well.
Many people consider, well, apparently not everybody, but many people consider version
control a solved problem.
Git is here to stay.
It's sold version control 15 years ago, and that's the end of the story.
Unfortunately, well, I personally don't think this is true.
Some symptoms indicating this might not be true are, for example, that's R-tools, no
matter how good we think they are.
They still aren't used by non-coders despite their maturity.
We've been working on this for 30 years, and the thing we're most proud of in this industry
cannot even be used by outsiders.
So, nowadays we have, like, Silverware designed by NASA, and everybody at NASA is proud of
that, and everybody buying Silverware designed by NASA is proud of that.
But R-tools for version control, they aren't used by outsiders or just marginally.
R-tools are distributed.
Yes, most of the time we use them with a global central server.
So I've seen a poster in this conference saying, not all paths lead to Chrome, but apparently
many paths lead to GitHub.
R-tools also require strong work discipline in planning.
You need to plan ahead of time, to plan your branches, to follow really strict workflows,
to rebase versus merge.
We even had a slide on that in previous talk, like how we merge shop or rebase shop.
In this talk, I'll try to convince you that this doesn't matter at all.
And all this results in a significant waste of human work time at a global scale.
Improvements have been proposed, for example, DAX, but they don't really scale and nobody
uses them.
So, is there, can we get a quick fix?
Can we get that next Git command that will fix everything?
Well, unfortunately, I don't think so.
First of all, because abstractions leak terribly in Git.
And the reason isn't the UI, isn't the bad naming of some commands or arguments.
The reason is that if we consider miracle trees and DAX as the core mechanism, they
can't really be hidden from the user.
Because if that's the core thing you want, and if all the properties you want come from
that, there's absolutely no reason to hide it from the user.
And there's no hope you can even succeed in doing that.
Also, similarly, if strict ordering of snapshots is the main feature, the most used Git commands
like rebase, we heard about rarararar in previous talk, cherpic, they're fixes around that best
feature we have.
So why do you need to fix your main feature?
That's strange.
Anyway, some more same terms.
There is an inflation of commands and options in Git.
Like I'll try to show you an example here.
This is God and comicals, it's the point that someone made a Git's man page generator that
actually looks credible.
If you reload the page here, you get a different one every time.
Give environment grabs non-reset downstream environments using past local garbage collectors
while overriding fitting shells survey the given environments.
Boo doesn't use this command here.
Anyway, there is also an inflation of UIs.
Even Big Tech is now investing in Git in mercurial UIs.
I won't cite names here.
But there's also inflation of forges.
How many forges have started last year alone?
I know a bunch of forges because they sometimes contact me to help them do something with
people.
My claim here is that we all consider text editing a solved problem.
There's no new text editor popping up every now and then and convincing the VC that this
is a really cool idea.
Window managers, same thing.
And forges, you keep seeing forges popping up every now and then.
I'm not saying this is bad.
This is actually really good.
It's fantastic that the ecosystem is thriving and there's a lot of diversity.
But my claim is that maybe this inflation comes from some more fundamental thing that
we don't understand.
Now, on to our demands.
First, we demand associative merges.
It may not matter much to use, but I'll show you in the next slide that you actually want
associative merges.
Associative merges means that when you have two changes, A and B, and you merge them together,
it should do the same as merging A followed by merging B.
And the reason you want that is because you want to be able to review your patches or
your comments one by one and then merge them and trust that the merge does exactly what
you think it does, except in get it doesn't.
Associative merges, if A and B can be produced independently, they all should not matter.
That's what I mean by are we a rebase shop or a merge shop.
It doesn't matter.
We don't want to ask this question.
We want to get back to work.
Branches, we do want branches.
Everybody loves branches, but not too many branches.
Branches are good until they aren't.
I'll tell you more on that later.
We also want that's something I personally really want.
I want low algorithmic complexity.
And also ideally, fast implementations was in my way of seeing things that come second.
But I also give you an example of a very fast implementation of something with Darm.
All right, associative merges, the first of our demands.
This is exactly what I described, but this is a graphic view of it.
So you have two co-authors, Alice and Bob.
Alice produces one, commits A, and Bob produces two, commits B and C.
And then in the first scenario, Alice merges Bob's first, commits, and then Bob's second, commits.
The second scenario, she merges both commits at once.
And while the commit identifiers, nobody would expect that they're the same.
They will necessarily be different.
But the contents of the, if there's no conflict at all, the contents should absolutely be the same.
And this is actually false.
So this is an example, a counter example of where Giz is not associated, and this might be a problem.
This is actually terrifying to me.
We start on the left here with a simplified two lines, A and B.
Alice follows that path.
She adds the first commits with a G at the very beginning of the file.
And then she goes on and adds another commit with another copy of A and B before that G, right?
And concurrently to that, Bob inserts an X between the initial A and B.
And when you can try that, if you have a laptop here, you can try to simulate that in Git today.
And this is actually not a bug, it's a feature of Git.
And what happens if you do that is that Bob's new lines get merged into Alice's new lines.
And if I were working in a high security project, this would absolutely terrify me.
This means that Git can randomly shuffle your lines around and do it silently.
Without telling you that there's no conflict, nothing.
It just works.
And yet it doesn't really work.
It doesn't do what you think it does.
So we don't want that.
We want to fix the problem.
We also want commutative mergers.
So that's a more controversial one.
We want the property that if Alice and Bob work together, well, Alice can pull Bob's changes
and Bob can pull Alice's changes.
Without having to worry too much about the resulting hashes or the contents of the file.
If the patches are independent, you should be able to apply them in any order without
changing the results.
So we do that.
Git and SBN are never commutative.
So why would we want this?
Actually there's very good reasons to want this.
For example, you might want to unapply and unapply an old change, an old patch that you
just made a few patches ago that was wrong.
And you want to undo it without having to change your entire identities.
Of course, we also want state identities and I'll come back to that later.
We want cherry picking.
We want to be able to just take that one patch from a different branch, maybe a bug fix.
I want to pull this into our branch without having to rebase everything and change every
commit's identity.
And yet keep strong, unfortable state identifiers.
And we want partial clones.
So partial clones means that you want to just pull the patches related to subprojects and
possibly also in the other direction merge repos transparently.
Scott was talking before about mono repos and how Microsoft devoted probably millions
of man hours to have their mono repos work.
But actually if you first try to model your things properly and try to understand commutativity,
actually you don't need all that.
You can get mono repos for free.
And so that brings us to one of the crucial slides in this presentation about states versus
changes.
So the way we see things in order to think about version control.
So I understand I'm not giving you many new commands or cool line hacks, but before getting
to that we had to think about what it means to do version control.
So states versus changes.
There's actually two fundamentally different ways to model what version control is.
One is by seeing it as a series of versions, a series of snapshots of your repos, which
is what Git does, and Mercurial, SVN, CVS do that.
And only compute changes as a byproduct of these.
The question we can ask is what if we did the opposite?
What is if instead of considering that working in a project means creating a new version,
we consider that working in a project means changing it, creating a change.
And another question we can ask is what if we did both at the same time?
And that's what we do.
All right.
So a little bit of bibliography first.
And this is getting a little bit mathematically.
So bear with me if you don't understand everything, I'll get back to cool implementation stuff
later.
So a change-based idea, and this is from the 80s, it's called operational transforms.
It is what Google Docs, for example, uses, or Docs uses that as well.
So in operational transforms, for example, we start with a file with only three lines,
T1 on the left inserts, so this is Alice, let's say, inserts X at the beginning of the
file, and T2 deletes the character C. So to get into two divergent states, and then
what operational transforms mean is that now that we've inserted an X at the beginning
of the file, this changes, we have to rewrite the other concurrent changes.
So for example, here, instead of saying delete the second character, it was a C, now you
have to say delete the third character, it was a C. And so that's how you can merge things.
Docs does this and uses this to detect conflicts.
One issue is that if you only have insertions and deletions, it's okay, but there are still
performance problems.
But as soon as you start handling more than insertions and deletions, you get into a quadratic
explosion of cases because you have to handle all pairs of types of operations.
And this is according to Google engineers who worked in the Google Docs project, this
is an absolute nightmare to implement.
And I've never heard anyone implement an operational transform, we said, yeah, they're cool, they're
really easy.
So a hybrid in more recent state approach is CRDTs.
So how many here know about CRDTs?
Okay, okay, reasonable number.
So the general principle is very simple.
The idea is to design a structure where all operations have the properties we want.
Instead of having your structure and say, okay, now how do we merge changes?
Instead of doing that, you take the problem in the opposite direction saying, how can
we design a data structure from scratch so that all operations on that data structure
have the right properties, meaning they are commutative, associative, they have a neutral
element and all that algebraic things.
Natural examples of CRDTs are like, for example, a very simple one, increments only counters,
counters where you can only add one to the counter.
This is a very easy and natural example of a CRDT because if Alice and Bob both increment
the counter, you just have to add two to the result to merge their change.
Insert only sets, for the exact same reason, insert only sets, there are natural CRDTs.
And if you want to do things like deletions or more subtle data structure, more subtle,
more interesting operations, you have to use more trickier techniques like Tom Stones
and Landport clocks.
I won't get into these details and if you're interested, we can talk later.
A useless example of a CRDT that's often invoked is a full git repository.
Well this is just an append only set of commits.
So yeah, sure, it's a CRDT.
It's commutative and all that.
But also, it's not super useful to see a git repository as a CRDT.
Just means you can clone this.
Saying this is a CRDT means basically it's okay, you can clone a git repository and then
keep pulling into it.
But it doesn't mean it handles merges properly.
What we want is the heads of the git repository to be a CRDT.
All right, so how do we do that?
Well, merges, when everything goes right, we're not really interested.
So we'll start by modeling conflicts because they're the hardest case.
And if we cannot model conflicts properly, then there's little hope of doing anything
interesting.
So conflicts are where we need a good tool the most.
The exact definition depends on the tool.
DAX for example has really cool and exotic definitions of conflicts.
For example, if Alice and Bob are writing to the same file at the same place, I'd say
most tools consider that a conflict, but not all.
When Alice renames a file from F to G, while Bob renames it to H, I'd say most tools also
consider that a conflict.
Some will just pick a name randomly.
Alice renames a function F while Bob adds a call to F. So that's a trickier one.
And very few tools that I know of can handle that conflict properly.
If a conflict can do it, my tool cannot do it.
And Git can certainly not do it.
So how do we do that?
How do we solve all these problems at once and get all the nice properties for free?
We do that by using category theory.
So this is a mathematical framework that gives you really nice tools to model, to work in
abstractions of things in general.
So our modeling in category theory of this problem is if you have any two patches, F and
G, what we want is a unique state P, which is the sort of the minimal merge of F and
G, such that for anything that Alice and Bob could do after F and G to reach a common state,
that common state, or common state Q, that common state can also be reached from P. So
instead of doing some work to get to common points, you can always first get to a common
point and then do the work.
So if P exists, category theorist call it the push out of F and G. So the reason we're
interested in that is because category theory has a lot of tools to start from this simple
modeling and give us lots of cool stuff like free data structures and while do our job
for us basically.
So in this case, category theorist would notice that the push out of two patches doesn't always
exist and this is absolutely equivalent saying that sometimes there are conflicts.
And now the question becomes how to generalize representation of states, states are like
X, Y and Z, so that all pairs of changes F and G have a push out.
The solution is to just generalize states to directed graphs instead of just sequences
of bytes where vertices of these graphs are bytes.
I'll give you an example in the next slide.
Verses are bytes and edges represent the union of all known order between bytes.
So that sounds a little far fetch, but actually it's very clear in the example.
So the way we model these things in Pichl is as fellow.
So the first example of a simple patch is how do we add some bytes to our data structures.
So we have a file.
So well first in our graph, all vertices are labeled by a change number.
Here for example C0 is change number zero and an interval such as zero and presenting
bytes inside that change, inside that patch.
And edges or our graph are labeled by the change that introduced them.
So for example, here's a starting from an initial file C0, zero, n.
How do we add some bytes in the middle?
Well, we'll first split the initial file and then add some bytes, add a new vertex inside
the middle of that vertex and then reconnect everything so that we can get the order of
rights.
And now that the bytes introduced by C0 between zero and i come before the bytes in C1 between
zero and m and these in turn come before the bytes introduced by C0 between i and n.
So this is how we do insertions.
The implementation now, the rest is a question of, is a matter of implementation like how
do we store these giants graphs efficiently on disk and so on.
Deleting works more or less in the same way except we now introduce a new thing which
is the edge label.
So deleting a vertex in our system means turning an edge, like turning an edge from a continuous
line into a dashed line.
And so we do more or less the same thing.
In this example we're deleting bytes J2i from C0 and zero to K from C1.
So some bytes that were introduced by previous patches.
And so this is what we guess in the end.
So we get a bunch of vertices and dashed lines to indicate which bytes should be deleted
and which are still alive.
All right.
And that's actually the good news is that we don't need more than that to build an entire
version control system.
So this is rebuilding foundations first, right?
This is actually really cool because it's a very minimalistic system and we like that
because it makes everything else easy.
So two kinds of changes.
One is adding a vertex to our graph in a context, meaning parents and children of that new vertex,
and change an edge's label.
And this is all we need actually.
We get, like, from these things we get a ton of cool properties for free.
First we get free conflict handling because we were just, like, there's no notion of
conflict in this.
We're just adding vertices, changing edges' labels.
And there's no, like, the graph naturally models conflicts.
So conflicts are possible.
They're properly modeled inside the graph and they can be talked about and manipulated
without any specific treatments.
So our definition of conflict here is we could first call the live vertices whose incoming
edges are all alive, meaning they're all full lines.
Live vertices or vertices whose incoming edges are all dead, and vertices in the middle that
have both alive and dead edges, we call them zombies.
We say now that the graph has no conflicts if and only if it has no zombie, and all its
live vertices are totally ordered, meaning we can actually compute a full ordering of
all the bytes.
We know exactly what order the bytes come in the file, and if we have that, then we
can output the file to the user, and it actually makes sense.
So some notes on this system.
So this gives a system where we have changes that are not exactly or diffs that are not
exactly Unix diffs.
They are Unix diffs plus tons of metadata to make it work.
And now they are partially ordered by their dependencies on other changes.
This means that you cannot possibly work inside a file that has not been introduced
in the repository yet, or you cannot edit a paragraph that doesn't exist yet.
So not all changes are commutative, but changes that could be made independently are commutative.
Chera picking, now this is the same as applying a patch.
We only have two commands in the system, apply a patch, unapply a patch, and it does
everything.
So chair picking is just apply a patch.
There's no need for git rararar, because conflicts are solved by changes.
You can chair pick changes.
So you don't have to do any special hack or Ruby Goldberg machine to remember the conflict
resolution or anything.
Like the conflict resolution is just a patch, and you can just send that to others, push
it, pull it, and that's it.
Partial clones, monorepo sub modules, so they are easy.
As long as white patches are disallowed.
So if you have a patch that just does some formatting across your giant monorepo, then
you will have a problem, because this patch will probably have tons of dependencies, and
you will end up pulling lots of dependencies into your repo.
But if you're careful enough to fragment or to cut this patch into smaller pieces, then
everything becomes easy, because you can just pull the patches that are relative to a tiny
part of your repository, or your giant monorepo, and it just works because of commutativity.
Because all the patches you produce locally by working, they do, they necessarily commute
with all the patches that were produced by your co-authors and other parts of the monorepo.
So after your day of work, when you push your patches to the server, others do the same,
but it doesn't matter, because it gives the same results in the end.
For large files, so I've showed you a graph in the previous slides that didn't talk at
all about the contents.
So the contents is something super important, obviously, but it's only handled during Diff,
and it is not, like you don't apply a patch, the patch themselves, they are not applied,
they don't use the contents of the vertices in order to be applied.
You can apply a patch by just saying, well, I just added this file, it has like one terabyte
of data, and that's it, and you can find the data in some change somewhere.
And so a nice consequence of that is that for large files, you can apply a patch without
knowing what's in the patch.
You can fetch the rest later.
And so if you're running a video game shop, for example, and you have artists pushing
assets, large binary assets all day, then at the end of the day, when you want to just
pull everything, you don't have to pull all the intermediate versions.
You pull just the operational part of these versions, saying, well, I added one gigabyte
here, and now I replace that gigabyte with another one, and then yet with another one,
and maybe you have 10 versions of your binary asset during the day.
But at the end of the day, the only thing you know is that, yeah, you had 10 versions,
there's only one still alive, just pull that one.
So there's no special hack or LFS needed, you're just using patches.
All right.
Now onto some implementation things.
So there's a lot of things to say about the implementation.
The project is entirely written in Rust, or mostly written in Rust, I'd say.
So I won't cover all the details about how it works and our implementation, because
there would be, that would take an entire day of talks, and I won't do that.
But I'll just give you some cool things that I like about our implementation.
So the first challenge is that we have really large graphs on disk, and we obviously don't
want to load them up entirely in memory in order to edit them.
We want to be able to just work directly with the on disk data structure.
So we want to store edges in a key value store, because that's an easy way to do that kind
of stuff.
We want transactions, because we're actually inventing a new format, data format, so we
absolutely want assets, properties, we want like full transactionality, we want passive
crash safety, so all these things.
And we want branches.
So we want to be able to take a key value store and just fork it without copying a single
byte, because it would take too long.
It would take a time linear in the size of history, and we don't want that.
So there's no key value store, there was no key value store when we started that would
do that, especially the branching feature.
So we had to write, I had to write one.
It's called Sanakiria, which means dictionary and finish.
It's a non-disk transactional key value store, but it's actually not just a key value store,
it's something really generic.
It's an ACID block allocator in a file, really.
And that block allocator uses a key value store, uses B trees to allocate memory, but
also the B trees themselves use the allocator to do their job.
So that's why the minimal data structure we can have is B trees, but also you can write
all sorts of data structures with Sanakiria just by using the allocator.
We have crash safety using referential transparency in copy and write.
The initial goal was actually done successfully, completed successfully, because the tables
are forkable in big O of log n, so it's probably completely useless for a general key value
store.
Then in our case, branches were really needed, so we had to do it.
So forkable in big O of log n, a logarithmic time in the total size of the total number
of keys.
As written in Rust, while it started in Rust back in 2015 or something, Rust was still a
bit younger than today.
And the cool bit about that is that it allows you to access generic Rust types directly
as pointers into, like, storing the file.
It is way too generic, way too many things, meaning that the APIs is horrible to use, but
there's a good consequence of the generality is that it's even generic in the underlying
storage layer.
For example, we can use it on memory map file, which we do all the time in Pichol, but we
can also do it in Z standard compressed files.
I've also used it not super successfully, but the prototype worked on Cloudflare key
values, Cloudflare KV, so storing, like, building a key value store on top of another key value
store.
The cool thing about that is that, you can also use it to build ropes, for example, or
Patricia trees or things like that, or vector search indexes.
So I thought that implementing this in Cloudflare KV was interesting, but it's actually too
slow to be really useful.
All right.
A very unexpected consequence of that is that Senacilia is the fastest key value store we've
tested and actually beats LMDB, which is the fastest C key value store by a pretty wide
margin.
In these graphs, I've included, actually, the coolest project ever in this space, which
is not Senacilia, it's Sled.
Sled is a really fantastic key value store that allows you to have multiple concurrent
writers using really, really cool data, like modern database technology and so on.
Senacilia is much more modest in its scope, but it's also way faster.
So if we remove Sled, which is more on the experimental side, we can see that we beat
LMDB by 50% or something.
I've included the Rust standard B trees in these graphs.
They're the theoretical limit.
You cannot go faster than that because they don't even store anything on disk.
We have to store stuff on disk, so we will be slower.
Okay.
So using Senacilia, I can build modular databases.
Like I said, this is transactional block allocator with reference counting included.
I've built different data structures in this.
One cool thing you cannot do with others that can be done in Senacilia are composite
types.
So for example, in Pihu, branches are B trees of strings to other B trees that store our
graphs.
So you can nest data structures like that, which is cool.
I have a prototype text editor with forkable files.
So you can click a button and have a free copy of your file, sharing all its common bytes
with the previous one.
And its type is something like that, a B tree of string to rope in the Pihu graph.
So if you're interested in data structures and performance challenges, join us because
we're doing cool stuff in the space.
All right.
So back to my claim that this is both GIT.
Things we get for free.
I've said a few of these so far, but we can get, one thing I haven't said is we can get
super fast Pihu credits, which is the Pihu equivalence of GIT blame because we don't
want to blame our co-authors.
We'd rather credit them.
So the info is readily available in the graph.
So you get that for free.
So Scott was speaking in his talk about how you can use really cool hacks to speed up
GIT's blame.
But actually, if you don't use GIT at all, like if you model your stuff properly, you
don't even need to speed things up because the information is readily available in the
graph.
And so it's fast by default.
You can have your bug fixes in your main branch.
You don't have to plan feature branches and bug fix branches in advance.
You can push them to production by having them on your main branch.
You can work on several features in the same branch and then decide what belongs to a feature
after the fact.
So no more rigid workflows.
No more like way less meetings, hopefully.
You can get sub modules.
So sub modules don't have to suck.
Don't let anyone tell you otherwise.
Sub modules don't have to suck.
You can get them for free using patch commutativity.
And the reason is changes in unrelated projects commute because they can be produced independently
because they are related to unrelated projects.
So they are commutative.
And so you can get sub modules for free.
Signing and identity.
So this is something that GIT introduced recently as well.
But after we did, your identity is your public key.
All patches are signed by default.
And the identity changes are easy and possible because we like to welcome everyone.
And people sometimes change their identity.
And we don't want to let their personal life interfere with their work.
You get free cherry picking.
So I've said that a couple of times already.
But you just apply the patch and no need to change its hash, its identity.
And you can get almost free scalability to very large mono repos.
So I've said that.
We Goldberg machine needed.
Just one cool bit of implementation.
We have commutative state identifiers.
So I've said earlier that we want to hash our patches to be able to make them inforgible.
But we also don't want it to be just like a soup of patches and like, you know, hidden
states where you cannot get your, you can get like where you don't understand anything
anymore.
So we do want state identifiers.
It's like commit hashes.
And but the thing we, the thing that's hard for us is because we want, because we, because
we, the patches are commutative, we want a then B to give the same state as identifiers
as B then A. And at the same time, we want these state identifiers to be fast to compute.
So we want, we want, we don't want to have to like, for example, when naive, naive, naive,
naive version would be like sorts all your hashes by, by hash and then hashed all of
that.
So that would take a time linear in the size of history, but there is a cool trick which
is to use discrete log and elliptic curves.
So this is where you turn each patch identity H into an integer and then you identify, you
identify your states using each the part, each the product of all your hashes.
And this is something you can compute very easily from a state and the next patch.
So this takes a constant time to confuse and it's commutative for free.
So this is, this is, this is a trick that I like.
Okay.
All right.
Now, what's future developments in that, in that space?
So we want, we're working towards a hybrid states patch system.
So in, in GITS, I said that commits our states, not changes.
There's like the blog posts all the time reminding us of that fact, even though patches can be
applied and recomputed after the fact.
Darks only has changes and recomputes states as needed.
So it's completely, completely opposite approach.
And Pihla's both, as I've tried to convince you.
It has a data structure modeling the current states, but that data structure was not actually
designed, it was found, it was, it was calculated from the nature of the patches we wanted to
have on text files.
And this is therefore completely transparent.
So this is not a leaky abstraction.
This is something that was found and calculated from the patches themselves and not just a
cache that, that leaks or is like sometimes becomes irrelevant to the patches that were
applied to it.
All right.
Ongoing projects in that direction towards a hybrid state patch system where we have
tags currently, but there are a bit, there are bits of slow and, and bloated.
So we want to, we're working towards a lightweight tags.
And this is a feature that will add super fast history browsing in, in Pihla, while retaining
all the good properties of patches.
Currently tags are implemented as a semi-quill data database using compressed files as a
back end.
Another thing we're, we've, we've been discussing on our, Zulip is patch groups or topics or
things like that to group patches together and to be able to say, well, I have just one
branch.
I don't want to work with branches.
So I only have one branch.
Maybe this is bad practice, good practice.
I don't know what, I don't, I don't want to tell people what their good practices should
be.
Like it should be what makes them fast and efficient.
That's all.
So patch groups would, would allow people to group patches together to have all their
patches in the, in the main branch and then push just patches related to one specific
feature.
So that would allow people to like be a bit more organized, for example.
Someone proposed queues recently to avoid half-merged states because when, when you're
merging states, patches one by one, sometimes you can get in a state that was never tested
by anyone.
So that's, that's not great.
And so we want to add queues to get the best of both words actually to say, well, actually
I've done a merge here.
I've not, I've just, I've not just applied the patch.
I've done, I've applied the patch together with other patches.
So that's the kind of stuff we're working on.
But this is not, not super, not super hard.
And if you're interested in contributing, we're really welcoming.
Well, we, we hope we're welcoming enough.
All right.
So if you want to help us, this is currently a large project and a small team.
There's lots of satellite projects.
So we've, we've built our own database system in order to build this.
We've built our own SSH implementation in order to build this.
But fortunately proper mathematics can make that work.
We have way less things to implement that, that any other version control system.
It's bootstrapped, meaning it's used, we've been using it for itself since 2017, which
wasn't without its problems in the beginning, because some patches could only be applied
if you already had the patch.
So there was a interesting, interesting challenges.
There's a lot of effort needed in maintaining the documentation, accessibility tutorials,
UI, bike shedding, we're really demanding of that.
We have good first bugs tags on our repository.
Well, our repository hopefully isn't hosted on GitHub because we can't, they don't support
Piholi yet.
So we had to build our own forge, unfortunately.
And so if you want to come help us, then yeah, good first bugs are the way to, to get started.
And come say hi on our Zulip.
It's the URL is here.
All right.
So open source version control based on algorithms and PRMs and also cool, low level stuff to
optimize databases, scalable to mono repos and large files for free, potentially usable
by non-coders.
I've told it to absolute non-coders.
I've discussed with some people in parliament, in the French parliament actually.
Artists can use it without having to learn what's a rebase, without having to know whether
they're in a merge shop or a rebase shop.
Lawyers can use it to version their, their documents and what about others.
Sonic Pi composers and music musicians, legal builders and whatnot.
We have a repository hosting service available.
That's it.
Thank you.
Okay.
If you're living live quietly, we have time for like a couple of questions.
Raise your hand.
Okay.
Quiet, please.
Hi.
Thank you for your talk.
I had a quick question regarding diff algorithms.
I think the default diff algorithm in GitHub is Meyer.
There's a bunch of other ones you can select.
Something I can't hear the question.
Can, please.
Sorry, I'll speak a bit more.
My question was regarding the diff algorithms.
I think in Git, the default algorithm is Meyer.
There's a few other ones you can select, I believe.
But something that's a bit too bad is that the diff algorithm is generic.
It's only diffing text.
It doesn't know what it is actually diffing.
If it's a JS file or a Python file, it doesn't really care.
And so, in a way, I think it could be interesting to have plug-and-play diff algorithms to have
more semantic diffing based on what interpreting what you're diffing.
Would that be something you're thinking about?
Yeah.
So the question is, can you swap?
Can you replace the diff algorithm with something else?
So, yes, actually you can.
But unlike in Git, Git doesn't do diffs.
You can do diff after the fact and change your diff algorithm after the fact.
Obviously here, this won't be possible because the diff is so core to what we do.
So you can definitely change the diff algorithm, but you have to do it while you're working
in order to create your passion.
You cannot do it after the fact.
So that's a trade-off that we've made.
Hi.
It looks like a really interesting tool.
I was going to ask how easy do you think it would be to automatically migrate from, say,
an existing Git repo into a repo kind of based around this system?
Well, you tell me.
We have a Git importer.
One pain point currently with the importer is that there's no exporter because importing
and then exporting would be doing a round trip.
Because diffs are ambiguous in Git, doing a round trip would create artificial conflicts.
So we haven't implemented that.
There are interesting challenges towards that goal.
So one thing for perfect interpretability would be to have an importer and an exporter
that work in a transparent way.
Currently, there's only a Git importer.
So if you want to convert your project into Pihl, you can use our Git importer.
You cannot just work on the side using Pihl and then collaborate with the rest of your
team using Git.
So we'd love to have that.
But there are theoretical challenges towards that goal.
But yeah, how is he?
I don't know.
Try it and tell us if it was too hard.
Hey.
Thanks for the talk.
I'm over here.
You said that DAX can recognize it as a conflict when Alice renames a function f to g and Bob
adds a call to f concurrently.
How does that work?
This was a little bit of exaggeration in what I said.
What DAX can do is that they don't do just insertions and deletions.
What they do is they have a command called DAX replace where you can replace an identifier
with another one.
And so they make it commutative with other operations so that they are able to detect
a conflict in some cases.
But I wouldn't rely on it to check for the semantics of my repository.
This is just a tool.
This isn't meant to solve all the problems of solving conflicts automatically or conflicts
are part of a normal working process.
As some say, seeing is believing.
You're working on something that can be called, can be similar to a group where.
So we know that someone like Douglas Engelbart provided a nice demo and showed that people
actually can use that.
So you're saying potentially non coders can use that.
But can you or have you some demos or something that can show this actually in action and
use by the non coders or do you have plans to show something like that?
I don't really have a demo of non coders working something.
So the non coders, the specific non coders I was thinking about when I wrote that on
the slides are people doing like contracts in my company.
So we're using plain text contracts for our customers to make sure that we're on the same
page and so we're using version control between the sales and implementation to make sure
we're not selling stuff that doesn't exist, for example.
So yeah, that's that kind of stuff.
But it's still very limited and we don't yet have demos, like entire demos of people
working like a part of the apartment were operating entirely in Pichu because no part
of the apartment has, so they've tried, they've started to look at it.
But no, there's no demo of that.
But yeah, we very much love to start collaborating with non coders to make it welcoming and useful
and fun for them.
So you think you can do that actually?
Yeah, I'm pretty sure I can.
I can explain the entire Pichu in just a few minutes, like the entire UI in just our CLI
in just a few minutes to people who've never coded before.
So yeah, apply a patch, unapply a patch, that's all you do.
I thought these, I was extremely excited to see this talk.
I started off very skeptical and I completely changed my opinion and I just really appreciate
kind of doing that.
It was a really effective talk and I'm very glad about that.
I wanted to say regarding operational transforms versus the directed graph model that you have,
I noticed that you had, there are boxes that you've drawn for operational transforms that
seemed similar when you drew the same boxes to describe your category theory, sorry, your
directed graph process.
The question I wanted to ask is, could you view the vertices of the operational transform
block as every vertex is the entire state of the document?
I want to ask you, is it correct to say that your directed graph model instead of having
each vertex be the entire state of the document, is instead individual byte ranges?
So I guess would you say that this analogy to operational transforms is kind of correct
in that each vertex is not the entire state and part of the reason why I would say Pjul
and, sorry, I mispronounced that, Pjul and the rest of your software I think works, I
think it's because it has less of a dependency on the entire state of the document.
So I don't know if I explained it right, but I felt like I had a realization as all and
I wanted to know whether you had, I guess, thoughts about how this directed graph model
is kind of inherently easier or harder than operation.
To me it makes sense why it's stronger and I just wanted to share that realization because
there's less dependency on state.
I wanted to know if you thought that that was a reasonable description.
Yeah, this is reasonable.
One comment I would have on that is the main difference with operational transforms is
that you don't have, if you want to, for example, merge n batches in a sequence, in Pjul you
don't have to look at pairs of batches.
You can just apply one batch and then the next one and they don't have to see each other
in order to, they don't modify each other when they're being applied, which is not the case
in operational transforms.
So that's the main difference.
But I agree that it's confusing that the diagrams look similar.
What is the killer feature of Pjul that will make it succeed where darks failed?
Performance.
Okay, speaking of performance, how much of that is attributable to the change of data
structures and algorithms and how much is attributable to rewriting it in Rust or in Haskell?
Most of it is algorithms.
Almost 100% of it is algorithms.
Sure, writing in Rust makes it faster than writing it in a garbage collected language,
but this is just marginal.
The main thing about Rust is that it allows you to write really low level stuff that allows
you to build different kinds of algorithms.
For example, Senakiria, I cannot see how this could be written in Haskell or OCaml, but
it would be really, really painful.
Rust makes it much easier because this is low level stuff and this is where you can get
most optimizations.
But yeah, performance, Pjul is, it was doubly exponentially faster than darks for mergers.
So back when, like two years ago when darks had the exponential merge problem, this has
been fixed since then.
And we're only exponentially faster than them there.
So for example, if you want to merge a patch, if you have a really large file with a really
long history in Pjul and you want to apply a patch in the middle of that file, this will
take a time logarithmic in the size of the file.
In Git, you have to look at the entire file, do a deep pre between that file and your patch
and then apply that.
So it's linear in the size of the file.
So we're still exponentially faster than that.
Of course it doesn't matter because most files are not so crazily large that we can see the
difference in the algorithmic complexity between Git and Pjul.
But on really, really large files, this would matter.
So what that means is, yeah, killer feature, you can scale to mono repos where darks, I've
seen it fail on a paper.
I've seen it take, I've seen, merged this take a really, really long time, several minutes
on the mathematics paper with 10 pages of latex.
And here we're like, everything works in milliseconds or less.
And because of the fear, like the scalability to mono repos in large files is not done using
extra hacks or extra layers or extra LFSs or extra sub modules or whatnot.
They're just like a byproduct of our design.
They're just, yeah, these are the killer features.
I don't know if I answered the question.
I get the name of Sanakya for dictionary, but why the name Pjul?
Oh, this is the name of a South American bird that has the property that it lays nest.
It builds a nest cooperatively in between a group of birds.
And then all the females of the group lay their eggs in the same nest and they take turns to keep them warm.
And so, yeah, it's just a metaphor.
Okay, our time is up.
And so let's thank again.
Thank you.