Published in Blog
Java versus Python performance benchmarks on PlanningAI models
Discover the techniques Timefold Engineers deploy to make your Python code faster.
Performance is a top priority for Timefold Solver, since any performance gains means we can find the same solution in a less amount of time (or alternatively, a better solution in the same amount of time). This is also true for Python, which is typically one or two orders of magnitudes slower than Java. Through a variety of techniques, we are able to make Python reach 25% of Java’s throughput. Read on to find out the methods we deploy to make Python’s performance better, and see how Timefold Solver for Python performance compares to Java on a variety of PlanningAI models.
Note
|
Usage of Python in this article refers exclusively to the CPython implementation. |
Why is Python relatively slow?
Python’s performance challenges are mostly due to design decisions:
Global Interpreter Lock
The Global Interpreter Lock, or GIL for short, is a lock that prevents multiple threads from executing Python code at the same time.
This is because Python’s core data structures, such as dict
, list
and object
are not thread-safe.
When the GIL was introduced (Python 1.5, in the year 1998), multi-threaded programs were rare (most computers only supported one thread at the time), and as such, using a GIL simplifies Python extension developer experience and does not degrade single-threaded performance. The GIL was a good idea for its time, but gradually became worse as multi-threaded programs and computers became more common. Currently, Python is working on removing the GIL, but it will be several releases before it is widely available.
Dynamically Typed
Python is dynamically typed, which means the result type of operations cannot be determined before their execution. Moreover, many operations such as addition and attribute lookup are also dynamic, since they look up attributes on their operands' types. As a result, this seemly simple python code:
def add(x, y):
return x + y
Does something a lot more complex underneath:
def ignore_type_error_if_builtin_type(arg_type, method, x, y):
try:
return method(x, y)
except TypeError:
if arg_type in [int, str, list, ...]:
return NotImplemented
raise
def add(x, y):
x_type = type(x)
x_add = x_type.__dict__.get('__add__')
y_type = type(y)
y_radd = y_type.__dict__.get('__radd__')
if x_add is None and y_radd is None:
raise TypeError('...')
elif y_radd is None:
out = ignore_type_error_if_builtin_type(x_type, x_add, x, y)
if out is NotImplemented:
raise TypeError('...')
return out
elif x_add is None:
out = ignore_type_error_if_builtin_type(y_type, y_radd, y, x)
if out is NotImplemented:
raise TypeError('...')
return out
else:
if x_type != y_type and issubclass(y_type, x_type):
out = ignore_type_error_if_builtin_type(y_type, y_radd, y, x)
if out is NotImplemented:
out = ignore_type_error_if_builtin_type(x_type, x_add, x, y)
if out is NotImplemented:
raise TypeError('...')
else:
out = ignore_type_error_if_builtin_type(x_type, x_add, x, y)
if out is NotImplemented:
out = ignore_type_error_if_builtin_type(y_type, y_radd, y, x)
if out is NotImplemented:
raise TypeError('...')
return out
For custom types, this complexity causes addition to have about double the overhead of a direct method call:
import timeit
source = \
'''
class A:
def __init__(self, value):
self.value = value
def __add__(self, other):
return self.value + other.value
a = A(0)
b = A(1)
'''
timeit.timeit('a + b', source)
# 0.04903000299964333
timeit.timeit('add(a, b)', f'{source}\nadd=A.__add__')
# 0.029219600999567774
However, Python specializing adaptive interpreter
speeds up the code for builtin types, such as int
.
import timeit
timeit.timeit('a + b', 'a=0\nb=1')
# 0.009246169998732512
timeit.timeit('add(a, b)', 'a=0\nb=1\nadd=int.__add__')
# 0.04777865600044606
Monkey-Patching
Python supports monkey-patching; that is, the behavior of types and functions can be modified at runtime. For example, you can change the implementation of a type’s method:
class Dog:
def talk(self):
return 'Woof'
a = Dog()
print(a.talk()) # Prints 'Woof'
Dog.talk = lambda self: 'Meow'
print(a.talk()) # Prints 'Meow'
Monkey-patching prevents method inlining, an optimization technique where a method is copied into its call sites, removing call overhead and allowing further optimization to be done.
Lack of a JIT compiler
Python currently lacks a JIT compiler, meaning it cannot optimize code based on runtime characteristics (such as how often a branch is taken). This will probably change in the future, as a experimental JIT compiler for Python is currently being worked on.
However, all the above is insignificant to what was our biggest performance obstacle…
Proxies are slow
A proxy is a class that acts as a facade for arbitrary classes. In order for Java to call a Python method (or vice versa), a Foreign Function Interface must be used. A Foreign Function Interface, or FFI in short, is an interface that allows one program language to call functions of another. In Java, there are currently two different FFI implementations:
-
(Java 22 and above) Foreign Function and Memory API (foreign)
JNI and the foreign API both returns a raw pointer (in JNI, a long
, and in foreign, a MemorySegment
) for Python objects.
However, a pointer has little use in Java code, which expects interfaces like Comparable
or Function
.
Typically, proxies are used to convert the pointer into Comparable
, Function
, or some other interface.
Why Proxies are slow
If a proxy is empty, it is only about 1.63
slower than a direct call.
However, proxies are often not empty.
Often, they need to do complex tasks such as converting the arguments and results of methods.
How slow are we talking? Consider this simplistic benchmark:
public class Benchmark {
public static void benchmark(Runnable runner) {
var start = System.nanoTime();
for (int i = 0; i < 10_000; i++) {
runner.run();
}
var end = System.nanoTime();
System.out.printf("%.9f\n", (end - start) / 1_000_000_000d);
}
public static void main(String[] args) {
System.out.println("Cold");
benchmark(() -> {});
System.out.println("Hot");
benchmark(() -> {});
}
}
Which we can invoke via JPype
import jpype
jpype.startJVM(classpath=['.'])
benchmark = jpype.JClass('Benchmark')
@jpype.JImplements('java.lang.Runnable')
class Runner:
@jpype.JOverride
def run(self):
pass
print("Cold")
benchmark.benchmark(Runner())
print("Hot")
benchmark.benchmark(Runner())
or GraalPy:
import java.Benchmark
def run():
pass
print("Cold")
java.Benchmark.benchmark(run)
print("Hot")
java.Benchmark.benchmark(run)
Here are the runtimes for running benchmark
with an empty Runnable
in Java versus an empty Runnable
implemented as a Python proxy:
Method | Runtime |
---|---|
Java Direct Call (Cold) |
0.000232610 |
Java Direct Call (Hot) |
0.000240525 |
Java FFI Direct Call (Cold) |
0.005904000 |
Java FFI Direct Call (Hot) |
0.000442477 |
JPype Proxy (Cold) |
0.006906666 |
JPype Proxy (Hot) |
0.004721682 |
GraalPy Proxy (Cold) |
0.027417475 |
GraalPy Proxy (Hot) |
0.000883732 |
Despite the Python code literally doing nothing, a JPype proxy is 967.102% slower than a direct Python call when hot. GraalPy fares better at only 100% slower than a direct Python call.
Things get even worse when there are arguments and return types involved (for example, with java.util.function.Predicate
).
Here, we pass a Python list to Java (which calls a Python proxy function on each item in a list).
import jpype
jpype.startJVM(classpath=['.'])
benchmark = jpype.JClass('Benchmark')
class Item:
def __init__(self, value):
self.value = value
@jpype.JImplements('java.util.function.Predicate')
class Runner:
@jpype.JOverride
def test(self, item):
return item.value
def proxy(item):
return jpype.JProxy(jpype.JClass('java.io.Serializable'), inst=item, convert=True)
arraylist = jpype.JClass('java.util.ArrayList')
items = arraylist([proxy(Item(i % 2 == 0)) for i in range(10000)])
print('Cold')
benchmark.benchmark(Runner(), items)
print('Hot')
benchmark.benchmark(Runner(), items)
or GraalPy:
import java.Benchmark
class Item:
def __init__(self, value):
self.value = value
def test(item):
return item.value
items = [Item(i % 2 == 0) for i in range(10000)]
print("Cold")
java.Benchmark.benchmark(test, items)
print("Hot")
java.Benchmark.benchmark(test, items)
Method | Runtime |
---|---|
Java Direct Call (Cold) |
0.000819039 |
Java Direct Call (Hot) |
0.000559228 |
Java FFI Direct Call (Cold) |
0.004265137 |
Java FFI Direct Call (Hot) |
0.000807377 |
JPype Proxy (Cold) |
0.015507570 |
JPype Proxy (Hot) |
0.012130092 |
GraalPy Proxy (Cold) |
0.070494332 |
GraalPy Proxy (Hot) |
0.004472111 |
Here, JPype is 1402.41% slower than calling Python directly when hot! GraalPy also degraded, becoming 453.906% slower than calling Python directly.
In particular, when the arguments are passed to Java, they are converted to Java proxies. However, when the arguments are passed back to Python, they are not converted back to Python objects. Instead, a Python proxy is created for the Java proxy, creating additional overhead for every single operation.
Why using FFI directly is hard
Given the problems with proxies, the obvious solution is to just manage the opaque Python pointers and do direct Python calls. Unfortunately, using FFI directly is hard:
-
In order for garbage collection to work, the Java and Python garbage collectors must be aware of each other (especially if Java objects are passed to Python). In particular, if an arena is closed, it is zeroed out (to prevent use-after-free). This means if Python is still tracking the object (because it retains a reference), a segmentation fault will occur on the next garbage collection. However, if an arena is never closed, we will eventually run out of memory. This is especially true for long running processes, such as solving. JPype solves this with deep magic that links Java’s and Python’s garbage collectors. Unfortunately, calling Python functions directly using FFI is unsupported in JPype.
-
Java FFI is designed around starting from Java, not from Python (i.e. Java host). As a result, without doing deep magic like JPype, a separate Java process must be launched to use FFI with an independent Python interpreter.
-
Java objects cannot be used as Python objects directly, meaning some proxies will need to be generated if a Java object is passed to Python code.
Avoiding all the problems (and replacing them with new ones)
So how can we avoid the performance losses from:
-
Global Interpreter Lock
-
Dynamically Typed Code
-
Monkey Patching Support
-
No JIT Compiler
-
Proxy Overhead
with a single solution? The (not so simple) answer is to translate Python’s bytecode and objects to Java bytecode and objects.
-
No Global Interpreter Lock, as Java doesn’t have one.
-
Can treat type annotations as strict, meaning we can determine which method to call at compile time in some cases.
-
Can remove monkey patching support; by the time the code is compiled, typically all monkey patching (if any) is already done.
-
Java has a JIT compiler.
-
Because the Python objects are translated to Java objects, no proxies are involved.
However, there is a not so simple part:
-
Unlike Java, Python’s bytecode changes drastically from version to version. Even the behavior of existing opcodes may change (i.e. Python bytecode is neither forwards nor backwards compatible).
-
Python’s bytecode has several quirks that makes it non-trivial to directly map it to Java bytecode. For instance, Python keeps its for loop’s iterator on the stack for the entire for loop body. As a result, when an exception is caught in a for loop, Python bytecode expects there to still be an iterator on the stack (whereas in Java, the exception handler’s stack is always empty).
-
The Java object result needs to be converted back to a Python object.
Annoying as these problems are, they are significantly easier to debug and test than using FFI directly; we can rely on JPype’s deep magic to keep Java’s and Python’s garbage collectors in sync, and we can have deterministic reproducers and test cases.
How the translated bytecode performance?
Is it truly worth all the hassle of translating Python’s bytecode? Well, the only way to know is to measure the speed before and after the bytecode translation. Bytecode translation was originally implemented in OptaPy (which Timefold Solver for Python is a fork of), which included an option to use the proxies instead. Since the bytecode generation became more mature, the option to use proxies was removed from Timefold Solver for Python.
Thus, to compare the performance of proxies versus translated bytecode, we will use the latest (and probably last) release OptaPy (9.37.0b0) on Python 3.10.
Here are the score calculation speeds of using proxies versus translated bytecode in OptaPy’s school timetabling quickstart (higher is better).
Note
|
OptaPy’s School Timetabling quickstart is not equivalent to Timefold’s School Timetabling quickstart: there are bugs in the constraints that are fixed in the Timefold versions but were never addressed in OptaPy’s version. |
Method | Score Calculation Speed |
---|---|
Proxies |
599/sec |
Translated Bytecode |
69558/sec |
In short, using translated bytecode multiplied our throughput by 100 (i.e. an over 100,000% performance increase)!
So… how fast is Timefold Solver for Python?
Now finally to the topic of the blog post: how fast is Timefold Solver for Python on various PlanningAI models? Below are the move evaluation speeds of the Java and Python Timefold 1.15.0 quickstarts:
Quickstart | Dataset | Language | Move Evaluation Speed |
---|---|---|---|
School Timetabling |
SMALL |
Java 22.0.1 |
230143/sec |
School Timetabling |
SMALL |
Python 3.12 |
64034/sec |
School Timetabling |
LARGE |
Java 22.0.1 |
174881/sec |
School Timetabling |
LARGE |
Python 3.12 |
42249/sec |
Employee Scheduling |
SMALL |
Java 22.0.1 |
85926/sec |
Employee Scheduling |
SMALL |
Python 3.12 |
27462/sec |
Employee Scheduling |
LARGE |
Java 22.0.1 |
53265/sec |
Employee Scheduling |
LARGE |
Python 3.12 |
15753/sec |
Vehicle Routing |
PHILADELPHIA |
Java 22.0.1 |
330492/sec |
Vehicle Routing |
PHILADELPHIA |
Python 3.12 |
18912/sec |
Vehicle Routing |
HARTFORT |
Java 22.0.1 |
337094/sec |
Vehicle Routing |
HARTFORT |
Python 3.12 |
21443/sec |
In short:
-
School Timetabling in Python is about 75% slower than Java
-
Employee Scheduling in Python is about 70% slower than Java
-
Vehicle Routing in Python is about 95% slower than Java
Where is performance lost?
If the Python bytecode is translated to Java bytecode, why is it still significantly slower than Java? There are several differences between the translated bytecode and normal Java bytecode:
-
The translated bytecode must be consistent with Python’s semantics. This means:
-
Cannot use 32-bit int or 64-bit long to represent a Python’s int, since Python’s int are
BigInteger
. -
Although some dynamic dispatches during operations such as addition can be determined statically, others cannot either due to missing or lost type information. In Java, all dispatches are determined at compile time.
-
In the case a function expects a Java return type or argument, a conversion to the appropriate Java/Translated Python type must occur.
-
-
The translated bytecode is far from idiomatic Java bytecode. As such, the Java JIT compiler, designed for idiomatic Java bytecode, might have trouble optimizing it. Even Fernflower (a Java decompiler) has trouble decompiling the translated bytecode!
-
Inlining opportunities are limited, due to the generated bytecode’s size and usage of interfaces.
What does the result mean in practice?
-
For school timetabling and employee scheduling, to get the same result as Java in Python, you should run it about four times longer duration, (so if Java runs for 1 minutes, Python should run for 4 minutes).
-
For vehicle routing, to get the same result as Java in Python, you should run it about twenty times longer duration, (so if Java runs for 1 minutes, Python should run for 20 minutes).
In many cases, this performance loss is acceptable, as it allows you to write your code in pure Python instead of Java, meaning:
-
No requirement to learn Java.
-
No requirement to make a separate Java model that matches your Python model.
-
Can use Python libraries like Pydantic, FastAPI and Pandas to create the planning problem and to process solutions.
Note
|
Python libraries might use functions defined in C, which naturally have no corresponding Python bytecode, and thus cannot be translated to Java bytecode. Usage of C-defined functions in constraints causes a necessary fallback to Python, and the Java objects need to be converted back to Python objects. This causes a massively performance degradation that has roughly the same performance as proxies. C-defined functions can be used outside of constraints without issue. |
Conclusion
To make Timefold’s goal of making PlanningAI accessible to everyone a reality, everyone should be able to write in the language of their choice. For many, that language is Python.
To make writing PlanningAI models in Python practical, the Python solver must match a reasonable fraction of the Java Solver’s performance. This is a challenge, as Python provides a myriad of obstacles to performance. But like any obstacle, they can be overcome.
By overcoming these obstacles, we managed to make the Python solver match 25% of the Java solver’s performance, making writing PlanningAI models in Python practical. That being said, for use cases that truly require the most performant code, it is probably best to stick to Java.