Can you hear me?
I think so.
It's working but not in a loud kind of way.
Anyway, I have a loud voice, so that's not a problem.
So I'm happy to be here.
I was here four years ago with everything that happened,
and I gave a talk on foreign memory API,
and it was an incubating API in Java 14, I think.
So I'm happy to be here now to talk about the foreign function
of memory API, which is a finalized API in the upcoming
Java 22 release.
So why did we do this API?
The main reason is that the landscape around
Java application is changing rapidly.
With the rise of machine learning,
it's Java developers often need to do tasks
that they necessarily didn't have to do before,
such as talking to highly optimized linear algebra
library that are not written in Java,
they are written in C, C++, or for trans sometimes even.
And the only way to reach to those libraries sometime is
just to reach into native code directly.
So these libraries will not be ported in Java,
most of the time because they keep changing.
So a new library pops up nearly every month with
a new kind of idea in order to do offloading of computation
to the GPU.
So how do we talk to native libraries in Java?
We do that with JNI.
How many of you have used JNI in this room?
OK, fair number.
So good audience.
With JNI, you can declare native methods.
Native methods are like absurd methods in the sense
that they don't have a Java method body,
but they have a body that is defined somewhere else
in a C file or a C++ file.
And it can be C, C++, even assembly
if you like to play with it a little bit.
JNI is flexible, but it has a little bit of issues
in the sense that it's what we call a native first programming
model.
So it pretty much focuses on giving you access
to Java functionalities from the native side of the fence.
So when you write in JNI, you realize that quickly you
are basically shifting all your computation logic
from the Java world to the native world
in order to minimize the number of transitions back and forth.
And that can be a problem.
There's also no, I guess, idiomatic way to pass data
to JNI.
Yes, you can pass objects, but that has an overhead.
Sometimes a lot of developers end up
passing logs as pointer, as opaque pointer
that are stored in some Java objects.
And that kind of works.
So the problem with native function, as I said,
they never exist in isolation.
They always have to manipulate some data.
And this data is often off heap, of course.
And there are not very many libraries in the JDK
that allows us to do off-heap memory access.
One of them is the DirectBuffer API.
So probably you are familiar with DirectBuffers.
They can be passed to native methods.
And there are some JNI functions that
allows us to, for example, get the pointer that
is backing a DirectBuffer.
So that the JNI code can manipulate the buffer directly.
One of the issues with DirectBuffer, perhaps the main one,
is that there is no deterministic way to free or unmap
a byte buffer.
So if you are done using your off-heap memory,
you basically have to wait for the garbage collector
to determine that the byte buffer is no longer
reachable from your application.
And that can have a latency cost.
There is also a problem in the addressing space.
The byte buffer API was born in 1.4 time.
So quite a few years ago.
And we only use ints as offsets there,
which means the maximum addressable space is 2 gigabytes.
With minus 1, yes.
With the advent of persistent memory,
these limits are starting to be a little bit tighter.
Also, there are not many addressing options provided
by Buffer.
Either we go on the relative addressing scheme
where basically we say, put in, put in, put in,
then we rely on a mutable index on the byte buffer
to keep track of where we want to store the bytes.
But that's low because we have to mutate some state
and then situ optimization have a little bit more trouble
coping with that.
Or we go fully explicit.
And so we put offsets everywhere in our code.
And that makes our code a little bit more brittle.
So this is what happens when you want
to access an IT library.
You have a client.
You have an IT library.
You have some JNI Goop in the middle.
What's inside the JNI Goop?
Well, a little bit of everything.
There are some native method declarations in the Java code.
Then if you compile this code using Java C dash H,
you have to use a special option, which
will generate the site headers file
that you need in order to implement your C JNI function.
So you go over to C, you implement your JNI function.
You compile that function, the C file,
which is your client compiler of choice.
You get back a shim DLL.
This DLL is not the library that you
wanted to talk to in the first place.
This is just some extra glue code
that you need in order to get to the library that you want.
So now you have two native libraries,
the one you want to talk to and the JNI DLL.
And that's a little bit suboptimal.
So what we need instead is a Java first programming model.
So something that allows us to reach
into native functions directly only used in Java code.
We also need, since we want to model off-it memory
in a more sane way, we need a replacement for the by-buffer
API, something that is more targeted at the use cases
that FFI has.
So we want deterministic allocation.
We want bigger addressing space.
We want better ways to describe struct layouts
so that we can access memory more easily.
And also we want to tie everything together.
So we want to define tools that allows
us to automatically generate bindings for native library
in one shot.
And we'll see a little bit about that later.
Ultimately, our goal is not to replace existing frameworks,
such as JNA, JNR, for example.
I think Charlie is going to talk about that maybe later.
But to help some of those frameworks
to overcome the workarounds that they
have to keep doing all over again,
because they don't have a proper API to deal with pointers.
They don't have a proper API to free pointers
when they are no longer used.
And so hopefully some of this stuff
is going to come handy in those cases, too.
So Panama is not just about the foreign function memory API.
Of course, that's a huge part of Panama.
But Panama also contains the vector API,
which is an end API to access SIMD computation
from Java code directly.
But there's also Babylon, a project that recently sprung up,
which allows us to see what's inside the body of a Java
method with a nice IR that can be introspective using Java.
So what can you do with Babylon?
For example, you can take a Java method that
contains a loop, for example.
And you can inspect that loop.
You can turn it into a GPU kernel.
And then you can use FFM to dispatch that kernel
using CUDA to the GPU.
So Babylon and FFM kind of comes together
and provides us a better and more robust solution
in order to do on GPU computing.
The main instruction when it comes to accessing memory
is called memory segment.
That gives us access to a contiguous region of memory.
There are, of course, two kind of memory segments.
This is similar to white buffer.
There are heap segments that are backed by on-heap memory.
And native segments that are backed by off-heap memory.
All segments have a size.
So if you try to access a segment out of bounds,
you get an error.
They have a lifetime, which means they are alive.
But then after you free them, they are no longer alive.
So if you try to access them when they are no longer alive,
you get an exception.
And some segments may also have confinement.
So they may start in a thread, and they can only
be accessed in the same thread where they started from.
How do we use segments?
Well, it's not too difficult.
It's very similar to white buffer.
You can almost see the translation,
the mechanical translation from the white buffer API
to memory segments.
Let's say that we want to model a point that
has fields x and y.
So what we have to do, we have to allocate a segment.
We do that using an arena.
We will see a little bit later what an arena is.
Let's just go with me for a minute.
We have to allocate a segment in 16 bytes,
because the coordinates are 8 byte each.
And then we put double values into each coordinates, one
at offset s0 and another at offset 8.
And that's how we populate the memory
of that particular segment.
So one of the issues that we have in this code, of course,
is that we are using an automatic arena.
An automatic arena is essentially
providing an automatic, the allocation scheme,
which is similar to the one that is used by the white buffer
API.
So we are not going to get any advantage here.
But we can do one better.
In fact, this is actually where we spend the most time
designing the memory API.
Java, as you all know, is based on the very idea
of having automatic memory management, which
means you only care about allocating objects.
The garbage collector will sit behind your back
and automatically recycle memory when no longer used.
This is based on this concept of computing
which objects are reachable at any given point in time.
Problems with this approach is that computing the reachability
graph, so which objects are reachable at any given point
in time, is a very expensive operation.
And you can find that garbage collectors, especially
the one of the latest generation, the low latency
garbage collectors, they don't want to materialize
the reachability graph as often.
So if you try, for example, to allocate
a lot of the red buffer using ZGC,
you will see that there's a lot more time
before the, by buffer, is collected
compared to having something else where you can actually
thermistically release the memory.
So that's a problem.
Another problem is that the garbage collector
doesn't have knowledge about the off-heap memory region that
can be attached to the red buffer.
The only thing the garbage collector sees
is a very small instance, a very small by buffer instance
that is like, I don't know, 16 bytes or something more.
But it doesn't seem that maybe there
are four gigabytes of off-heap memory attached to that.
So there's no way to prioritize that collection.
And also, garbage collector only can keep track of an object
as long as if it's used from a Java application.
So if that by buffer escapes to native code,
then it's up to the developer to keep that object alive
across the native code boundary.
So you have to start playing with reachability fences,
and your code suddenly doesn't look as good anymore.
So what we need is a new way to think
about managing memory resources explicitly.
And that's challenging because we
are sitting on top of a language that
made its success on the very idea of basically never
worry about releasing memory ever,
because the garbage collector will do it for you.
So what we introduced was an abstraction called arena.
And arena models the life cycle of one or more memory segment.
All the memory segments are allocated with the same arena
at the same lifetime.
So we call this a lifetime-centric approach,
because first you have to think about what
is the lifetime of the memory that you want to work with.
Then you create an arena that embodies that lifetime,
and then you start allocating memory.
There are many kinds of arena.
Of course, there is the silly global arena that you can use.
And basically, whatever you allocate, it stays alive.
It's never collected.
There's the automatic arena, which we saw before,
which basically gives us an automatic memory management
scheme, which is similar to the buffer.
But then there are the more interesting confined and shared
arenas.
These are arenas that support the autoclosable interface.
So if you call close on that arena,
all the memory that has been allocated with that arena
will basically just go away deterministically.
We don't need to wait for the garbage collector to do that.
There are strong safety guarantees.
Regardless of whether you are in the confined case
or in the shared case, it's not possible for you
to access a segment after it has been freed.
And in the shared case, we had to do a lot of JVM black magic
in order to make this work.
Because of course, you can think, well, we just put a lock.
Whenever you access a memory segment,
we'll check whether the segment is still alive using
an expressive operation.
And then you realize that memory access is 10x slower
than before.
So what we did instead is, with the help of the GC team,
we relied on some safe pointing mechanism
to make sure that it is never possible to close a segment
while there is any other thread that is trying
to access the same segment.
That works very well.
Of course, it's a little bit more expensive
if you need to close shared arena very frequently.
But hopefully, you won't need to do that.
So what we are trying to do here is to find an epibalance
between the flexibility of C automatic memory management,
sorry, the thermistic memory management,
where you have to do free and maloc explicitly.
That's very flexible, but it's also very unsafe,
because you can have use after free,
you can have memory leak, or the extreme safety of Rust,
which comes at the expense of some flexibility
when you try to code.
Because if you want to do, for example,
secret data structures in Rust, like a link list,
it becomes very, very, very difficult.
So Java is trying to sit in the middle.
And I think we've done a good job doing this.
So how do you work with explicit arenas?
It's basically the same as with automatic arenas.
The only difference here is that now we
are using a try with resource statement.
So we create the arena inside the try with resource block.
We do the allocation.
We populate the point struct.
And then when we close the brace, all the memory goes away.
So this is much better than the direct buffer counterpart,
especially if you need to frequently allocate off-heap
data structures, because we no longer
put load on the garbage collector
just to clean up the off-heap memory.
So one thing that we need to still improve on this API
is how do we access the fields of the struct
that we want to operate with?
In the example that I showed previously,
we had to say, well, I want to access off-heap zero.
I want to access off-heap eight, because we
knew these were the offset where my fields are.
But what if we could just declare
what is the layout of the struct that I want to work with?
What if we can translate the struct point 2D definition
that we have in C into a Java object that models
the same layout?
Then we can start asking interesting questions,
such as what is the layout of the field x or y.
Give me avarendo for accessing the x field.
And that is exactly what we are doing here.
So instead of just relegating the definition of point
2D in a comment, we actually define
the layout of the point struct as an object, as a Java object.
And then we use this object to derive the two varendals, one
for accessing the x field and one for accessing the y field.
Then inside the try with our sources,
we can just use the varendal to access the fields.
We don't have to specify the offset eight for the field y,
for example, because the varendal will encode all the offset
computation automatically.
At the same time, look in the allocation expression,
the very first inside the try with our source block,
we can see that we are just using
passing the layout to the location routine.
And the layout, of course, knows what
is the size of the block that we want to allocate.
So switching gears a little bit, let's
start talking about FFI.
The main abstraction in FFI is called native linker.
This is an object that essentially embeds
the calling convention of the platform in which the JVM runs.
It provides two capabilities.
The first is it allows us to derive a method
end all that targets a native function.
So we can basically describe the native function
we want to call, get a method end all,
and just call it from Java.
The second capability is kind of the reverse of that.
So we have a method end all that describes
on Java computation.
We want to turn it into a function pointer,
so a memory segment, that then we can pass back to native code.
In this approach is inspired to, for example,
Python C types or lib FFI.
These are kind of the main inspiration.
So we want to be able to describe a function from Java,
so then we can call it directly.
It all builds on the abstraction that we've seen so far.
So we use layouts to describe the signature of C functions.
We use memory segment to pass addresses or structs.
And we use our in-apps to model life cycles of upcalls
and to model the life cycles also of loaded libraries.
So when we want to call a native function,
so here I define a function distance that take a point,
returns the distance of the point from the origin.
Actually, doing that in C is a little bit more convoluted
than it looks like, because it essentially depends
on the platform we are on.
So if we are on Linux, we will have to look at some rules
that are called the CSV calling convention.
And that tells us that, for example,
structs that are as big as the point to destruct
that we have here can have their fields pass in registers.
So the only thing that we need to do when
calling the distance function is to load the first floating
point register with the value 3, the second floating point
register with the value 4, then we just jump on the function.
But if you are on Windows, even if you are on X64,
but on Windows, there is a completely different set
of calling convention, which actually tells us
that any struct that is bigger than 64 bit,
such as our struct here, will be passed in memory instead,
which means the struct has to be spilled on the stack,
a pointer to the stack has to be stored in the RCX register,
and then we jump to the function.
So same function, same architecture, because X64,
completely very different set of assembly instruction
that needs to be generated in order
to act as a trampoline from Java code, for example, to C code.
So that's why it's important that we
are able to describe the signature of a C function
to the linker, because the linker then
will inspect the signature of the C function
and will determine what is the exact set of instruction
that we need in order to go from the Java
code to the native code underneath.
And so how do we do this?
Well, when we call the down call end on the native linker,
we will pass, of course, the address of the function
that we want to call.
This is obtained using a symbol lookup,
which we won't have time to investigate in further detail.
But it will basically give us the address
of where the distance address function lives.
And then we provide a function descriptor.
This function descriptor is nothing but a set of layouts,
one for the return type and one for the argument.
In this case, we know that the return type is double.
So we use a double layout.
And the argument is actually the point
to this track that we defined before.
So that same layout can now be reused in order
to describe the signature of the function.
Then inside our try with the source,
we populate the point as before.
And then we can call the method end.
So we just pass the point memory segment
to the method end that we obtain.
And that means that we will be able to pass the point by value
to the C function.
And nothing else needs to be done,
because the linker will figure out exactly what set
of machine instruction to generate in order to go there.
So of course, when we talk about native function,
we always have to keep safety in the back of our mind, right?
Because whenever we go into native,
the operation is fundamentally unsafe.
We could, for example, make a mistake
in describing the signature of our target C function,
which means the assembly step that we have
is not correct for calling that particular function.
We may cause all sorts of issues.
The foreign code may attempt to free memory
that has already been freed from Java code.
Or we may get a pointer from native code.
We may try to resize the pointer,
but we got the size wrong.
And so we are suddenly trying to access memory
that is not there.
So in the FFM API, there is a concept
that is called restricted method.
So there are some methods in the FFM API
that are not directly available all the time.
They are part of the Java API.
So if you go in the Java doc, you can see them.
But they are restricted, and you need
to use an extra command line flag
if you want to use them without warnings.
So for now, basically, if you try to use a restricted method,
such as the method for creating a down call method endo,
you will only get a warning.
But in the future, we plan to turn this warning
into an error.
And in that case, you will have to use a new option that
is called dash, dash enable native access that
will grant a subset of the models of your application
or the all unnamed model if you are using the class path,
access to restricted methods.
This is a part of a bigger plan to move Java
on a more solid foundation, one that
allows us to provide integrity by default.
So Java in its default configuration
should always preserve integrity,
which means it shouldn't be possible for native code
to mess up with invariants, such as, for example,
mutating final fields and things like that.
So this is the workflow using FFM API when
we want to access a native library.
So we still have something in the middle between us
and the native library that we want to call.
This time, though, the stuff we have in the middle
is just Java objects.
We have memory layout, varendals, method endals, function
descriptors.
But here's an idea.
What if we could generate all this stuff mechanically
using a tool?
And that's exactly what the JXR tool does.
So let's say that we want to call the QSAR function, which
is actually a tricky function because it has a function pointer
that allows us to sort the contents to compare
elements of an array.
So it uses a function pointer type def.
So if you want to model this using plain FFM,
it's going to take you a little bit of setup code
in order to create the app call stub
and the method endals that are required to call this.
But if you give all this header to JXR,
so we could just start with pointing it at the header
or the standard library header where this is defined,
then we basically just get a bunch of static declaration
that we can use to call QSAR.
So if I do all this, the only thing
I have to do from my code is first
to create the function pointer.
And this is possible with a factor
that has been generated by JSTRAP that allows me to pass
a lambda expression.
And the lambda expression will be
turned into a function pointer that is stored
inside a memory segment.
And then I can pass to the QSAR function.
And the QSAR function is not a method endal anymore.
It's a nice static wrapper around the method endal.
So it's much better to use from the developer perspective
because using method endal can sometimes
be tricky with the fact that we can pass the wrong type
and then it gets lower and things like that.
So in comparison, this is the code
that you have to write if you wanted to do this using JNI.
So there's Java code with native methods.
There's another file that is generated by Java C.
And then there's quite a bit of C implementation
in order to do QSAR.
And it actually took us a few attempts
in order to get to the best optimal implementation
because our first attempt wasn't very good.
It can actually get quite tricky.
And even better, if you look at the performances,
the plain FFM-based approach is roughly 2x, 3x faster
than the JNI approach, every optimized JNI approach.
And that's because a colleague of mine,
Neal Verne, has put a lot of effort
in trying to optimize, especially the up-call path.
So when you want to call a Java function from native code,
there was a lot of performance left on the table from JNI.
And we were able to greatly improve the performances there.
For regular calls, you probably won't see much difference.
So FFM is more or less on par with JNI.
But as soon as your native call is starting to up-call back
into Java, you're going to see massive differences.
So wrapping up, FFM provides a safe and efficient way
to access memory.
We have deterministic location.
We have layouts to describe structs.
And so it gives us ability to describe the content
of the memory that we want to work with and then get
varendals to access that memory in a much more robust way.
Then we have an API to access native function directly
from Java.
So no need to write JNI code.
That means that your deployment gets simpler,
because you don't have that shim DLL going around
that you need to distribute along with your application.
And together, the foreign linker and memory segment
and layouts provide the foundation
of a new interrupt story for Java that
is based on a tool called JSTRACT, which allows us
to target native library directly.
One thing that emerged while we were working on FFM
is that there was quite a lot of number of use cases
that we didn't anticipate at first.
Since FFM is a fairly low level library,
it allows very easily for other languages
that are built on top of the VM, such as Scala, Closure,
or even Ruby to use the FFM layer to then target native
function.
That was very expensive to do with JNI,
because it meant that the other language sitting on top of the VM
needed to spin some JNI code in order
to be able to do that or maybe uses a library like libffi.
But with FFM, this is possible directly out of the box.
And I think that's a good improvement.
We have been incubating and previewing, of course,
for a long time, since JDK14, essentially,
so that allowed us to get a lot of feedback from Apache
Lucene, Netty, Tomcat.
And I think today they are in production
with some of this stuff.
So I think if you run Lucene with Java 21,
you are getting a code path that uses FFM under the hood.
And I think that helped them to get rid of some of the issues
where they had to use unsafe in order to free the memory that
was mapped because otherwise waiting for the garbage
collector could lead to other issues.
We also are being used by Tornado VM.
So in that case, it's an interesting case
where memory segments are used to model memory that
is inside the GPU.
So they are using memory segment in a very creative way there
and a bunch of other projects as chime in as well.
So for us, it was a very successful experience
of using preview features because it allows
us to gather a lot of feedback.
Not necessarily, we have a lot of knowledge
on these topics within the JDK team.
So it was good for us to put something out.
And then here, our people were using some of this stuff
and make it better.
That's the end of my talk.
These are some of the links.
I hope that you are going to try FFM in 22.
You can subscribe to the mailing list and send us feedback.
There is a link to the JSTRAC tool.
So there are binary snapshots available.
So you can grab the latest one and start extracting
your library of choice and play with it a little bit.
And then a link to the repos.
But that's mostly it.
Thank you very much.
The first question.
Questions?
Who is FFM-focused from Canadian technologies?
I think it's pick.
Yeah, basically what is the difference between these and Kotlin native,
since Kotlin native can provide access to off-if memory and native function as well.
I think they are very similar.
One of the things that I think Kotlin native cannot do because it's still sitting on top of the VM,
and it has to play by the rules of the existing libraries,
is that it cannot have a solution for releasing memory safely.
So I believe that Kotlin native is going to use some,
I mean it's going to say at some point,
oh if you use pointer your code is going to be unsafe,
and you try to free a pointer then all bets are off.
So this is the main,
without solution if you use memory segments,
you can close an arena and your code will never crash.
You may get an exception.
This is the same thing as the same thing.
Yeah, but you know the APIs I've seen so far,
there is always a whole,
like if you use them correctly it works,
but there are ways to use them for multiple threads where it's not working,
unless you go deeper at the VM level of course,
which of course Kotlin native cannot do.
Up here, Mereh, Siou.
Go on, question.
Mereh, Siou, here.
Do you know how many platform specific hacks need to be done,
like if I want to use one code on like ARM macOS and Linux risk v or something,
or is it all fully one code for all platforms?
So in terms of JStrack the model,
sorry the question was,
do our platform specific is all this?
Do we need to worry about differences between platforms?
The answer is yes,
in the sense that the JStrack tool is going to give you a binding
for the platform that you are running on.
Now, this sounds scary.
In practice, for example,
if you work with a high level library such as Lib Clang for example,
we have a single run of JStrack,
and then we reuse it across all the platforms and it works fine,
because that library is defined in a way that is portable.
If you work with system libraries,
of course you are going to have a lot less luck,
and that system library is only going to work on one platform,
and all the platforms will need to do something else.
Yes, can you tell us about the memory footprint compared to JNI?
Memory footprint compared to JNI.
So of course if you use memory segment,
there is a little bit of footprint because you have an object that embeds an address,
so you don't have a long.
But our plan is to make all these memory segments
scalarizable because the implementation is completely hidden.
You only have a sealed interface in the API,
which means all these interfaces are going to be implemented by value classes
when Valhalla comes,
which means if you bring up a memory segment,
you wrap a memory segment around an address,
you are not going to pay anything allocation-wise.
For now, there is a little bit of cost in the cases
where the VM cannot figure out with escape analysis the allocation,
but in the future we plan for this to completely disappear.
Yeah, okay. Sorry.