Okay. So we welcome Antoine Delpache, if I'm correct.
And yeah, Florey Searst.
Thank you. So I'm Antoine Delpache.
I'm a developer on the Open Refine project.
And I'm very happy to be back in this bedroom to tell you,
give you a few news about Open Refine.
And in particular, I'm going to be focusing on what I'm working on right now,
together with Zoe Cooper, who's a designer on the project,
to make Open Refine more reproducible.
So I will first explain to you what Open Refine is,
because I'm not assuming everyone was here four years ago.
And if you were, don't worry, there are some differences
that you might be able to spot.
I'm very keen to know if those differences look good to you.
And also, what do I mean with reproducible in this context?
So what is Open Refine?
It's a data cleaning tool.
So you can import tabular data, mostly, in it.
And then it lets you do all sorts of cleaning operations on it.
Guess what?
So let me give you an example.
So this is a database of filming locations in Paris.
So every time you film something in Paris,
you need to register it with the city, and then they make this data set.
And one thing I can do here is to say, let's match all of those films
with an external database.
And we call that reconciliation.
So in this example, I'm going to reconcile it with WIC data
that we've already heard about earlier today.
And because reconciliation is a bit of a tricky process,
we have various options to let you configure how we're going
to match your data to WIC data so that we just don't only rely on the names,
but also on other attributes that we have in this data set.
And we then have various tools to help you make that a little bit efficient
and let you review the results of the reconciliation manually.
So for instance, here I can hover this and get a link to the WIC data item
that it could link to.
So that's a sample of one type of operation that people do a lot with
OpenRefine.
You can then manually match things if you want to go through the entire
data set yourself.
Let me show you something else.
Well, first, once you've done this reconciliation,
you can pull some data from the target database.
In this example, I could, for instance, do something quite simple.
Sorry.
Let's just add a new column with the URLs of those entities
in the database.
So that's something that I can do quite quickly.
And you get your new column.
You could also pull more information from WIC data, identifiers in other
databases, things like that.
Let me show you another sort of operation you can do in OpenRefine.
This is the column with the directors of those films.
And I can try to cluster them.
So what does it mean?
Well, we are going to basically look through all sorts of values in this
data set and try to detect whether they might refer to the same entity.
And when that's the case, then you often want to normalize those to one
consistent spelling.
That's very useful, typically, as a first step for reconciliation.
So those are samples of the canonical values you could use.
So let's say I want to use all of those suggestions and accept them as
valid clusters.
OK.
So those are the sorts of things you can do in OpenRefine.
Now, what do I mean by making this tool more reproducible?
So imagine you're a researcher working on some data that you've collected.
You're cleaning it with OpenRefine as part of your research process.
And at the end, you want to publish a paper about what you did and you want
to make your research process transparent.
So you want your fellow researchers to be able to inspect what you've done
in OpenRefine and ideally even reproduce it on a similar version of the data set.
So what can we do for now?
The best thing we have for this so far is our undo-redu tab.
And as you can imagine, it's primarily designed for undoing things that you've
done, but it also happens to list all of the operations you've done so far with
OpenRefine.
So you could try and copy and paste this in your research article as a way of
saying, this is what I did.
Now, this is not exactly ideal.
So we are working on improving basically this part of the tool.
And before we get into reproducibility per se,
there's already a lot of usability issues with this interface.
And that's where it's been very interesting to work with a designer on this
project who was also not familiar with the tool before she came on board.
And so she was really able to come with a fresh eye and identify things that I
really couldn't see anymore because I've been looking at this for so many years
already.
So for instance, here, it might not be clear to everyone that you can actually
click on those previous steps to go back to them.
We don't have any undo button in OpenRefine.
We only have this weird undo, redo tab where you can't really click on the undo
or the redo, like things like this.
And so it's been really eye-opening.
What else can you not do?
Well, say I realized that this match here was wrong and I want to undo just this
operation, but I want to keep all of the following ones.
There's no good workflow to do that, but it's very often requested.
So let me now show you what we can do with those extract and apply buttons here.
So I'm going to roll back here.
And if I click extract, I get this interface where I can select some operations
I'm interested in and then I get some code for them.
And this big blob of JSON is something I can copy and share as the representation
of those operations.
And I can also reapply them later on on this project or another one.
Now, the problem with this is that it's very hard to work with this representation.
It's very unreadable.
And it's also very brittle.
So for instance, if the column names of your new data set do not exactly match
the column in the original data set, you will have horrible errors and it will be very
hard to do anything with those operations.
So that's the core of what we're trying to solve, providing a better representation
for those operations so that you can understand what they are and also reapply them reliably.
So as a summary of the main goals of this project, make the basic undo-redu functionality
just more usable.
Then make this reproducibility also easier and effective because we want those representations
of operations to be reliably applicable.
And also adding this advanced undo functionality of undoing not just the latest operation,
or maybe just modifying the parameters of an earlier operation.
So that's the main goals.
And what do we have so far?
Well, you might have already noticed some differences in this prototype, but let me show you another
one.
So far I've been working on making open-refined operations aware of which parts of the data
set they modify.
Because the problem is, if you want to let people undo a deep operation, then you need
to be able to detect which following ones can be kept or not, or if they need to be
recomputed because the data they were working on has been touched.
So now that we have this capability of scoping operations a little bit better, you can, for
instance, run reconciliation on multiple columns and that will run concurrently, which is something
that wasn't possible before.
So you see the reconciliation I started earlier, it's only 7% complete.
It's a very slow operation because which data reconciliation is particularly slow.
And now I can already start reconcealing the other column.
And if you see, we already get some results, although the first one hasn't completed yet.
So that's already won win.
It's not directly about reproducibility, but I hope this will be work on by users because
it should save people a lot of time.
And on top of that, we've done some research about how other tools represent pipelines or
their undo-redu functionality.
So this is a screenshot from Talent, another data cleaning tool that we've been looking
at.
And in those sorts of data cleaning tools, you design your pipeline explicitly on a canvas.
So it's a very different sort of user experience.
But we've also been looking at Excel, how they let you track changes, or basic undo-redu
functionality in Google Sheets, things like that.
So that's been also very interesting in trying to get some sort of user experience that our
users are already familiar with.
So as you can see, this is all work in progress.
This is what I have just here, a prototype.
We don't have full answers to all of those questions yet.
But we're working on this, and we are very keen to hear from you.
So if you're interested in those topics and would be happy to test out some ideas with
us, we're running some user testing sessions.
So you're very welcome to sign up for those.
And that's basically the state of the project.
And I also have some open refined stickers if you happen to organize some training events
in various places.
So do also get back to me if you want some.
Thank you.
Thank you.
We can maybe take one question.
Thank you for the presentation.
So it's an interesting piece of software.
But what exactly is the target audience?
Because I mean at some point, if you have the data rendering script, it makes the
job.
I mean, to not get me wrong, it's interesting.
But just to know who exactly you are targeting.
So the question is, what is the target audience of open refined?
So it's a broad range of communities.
I would say it's generally suited for tasks where you can't really just write a script
upfront, which will do your keening.
And it's not really about whether you like programming or not.
It's just some tasks where you need to be looking at the data while you're doing the
cleaning.
As you saw reconciliation, it's a messy thing.
You can't really just come up with the parameters and make the matching.
You need to be looking at the data.
Same for clustering.
So it's a mixture of interactive data cleaning and a little bit more automation that you
would have in Excel.
So basically here the point is the point is the point click aspect for the operations.
So the real point is the point click aspect for the user.
Let's thank Antonina again.