Performance is a top priority for Timefold Solver, since any performance gains means we can find the same solution in a less amount of time (or alternatively, a better solution in the same amount of time). This is also true for Python, which is typically one or two orders of magnitudes slower than Java. Through a variety of techniques, we are able to make Python reach 25% of Java’s throughput. Read on to find out the methods we deploy to make Python’s performance better, and see how Timefold Solver for Python performance compares to Java on a variety of PlanningAI models.
The Global Interpreter Lock, or GIL for short, is a lock that prevents multiple threads from executing Python code at the same time. This is because Python’s core data structures, such as dict, list and object are not thread-safe.
Python is dynamically typed, which means the result type of operations cannot be determined before their execution. Moreover, many operations such as addition and attribute lookup are also dynamic, since they look up attributes on their operands' types. As a result, this seemly simple python code:
def add(x, y):
return x + y
Does something a lot more complex underneath:
def ignore_type_error_if_builtin_type(arg_type, method, x, y):
try:
return method(x, y)
except TypeError:
if arg_type in [int, str, list, ...]:
return NotImplemented
raise
def add(x, y):
x_type = type(x)
x_add = x_type.__dict__.get('__add__')
y_type = type(y)
y_radd = y_type.__dict__.get('__radd__')
if x_add is None and y_radd is None:
raise TypeError('...')
elif y_radd is None:
out = ignore_type_error_if_builtin_type(x_type, x_add, x, y)
if out is NotImplemented:
raise TypeError('...')
return out
elif x_add is None:
out = ignore_type_error_if_builtin_type(y_type, y_radd, y, x)
if out is NotImplemented:
raise TypeError('...')
return out
else:
if x_type != y_type and issubclass(y_type, x_type):
out = ignore_type_error_if_builtin_type(y_type, y_radd, y, x)
if out is NotImplemented:
out = ignore_type_error_if_builtin_type(x_type, x_add, x, y)
if out is NotImplemented:
raise TypeError('...')
else:
out = ignore_type_error_if_builtin_type(x_type, x_add, x, y)
if out is NotImplemented:
out = ignore_type_error_if_builtin_type(y_type, y_radd, y, x)
if out is NotImplemented:
raise TypeError('...')
return out
For custom types, this complexity causes addition to have about double the overhead of a direct method call:
import timeit
source = \
'''
class A:
def __init__(self, value):
self.value = value
def __add__(self, other):
return self.value + other.value
a = A(0)
b = A(1)
'''
timeit.timeit('a + b', source)
# 0.04903000299964333
timeit.timeit('add(a, b)', f'{source}\nadd=A.__add__')
# 0.029219600999567774
Python supports monkey-patching; that is, the behavior of types and functions can be modified at runtime. For example, you can change the implementation of a type’s method:
class Dog:
def talk(self):
return 'Woof'
a = Dog()
print(a.talk()) # Prints 'Woof'
Dog.talk = lambda self: 'Meow'
print(a.talk()) # Prints 'Meow'
Monkey-patching prevents method inlining, an optimization technique where a method is copied into its call sites, removing call overhead and allowing further optimization to be done.
Python currently lacks a JIT compiler, meaning it cannot optimize code based on runtime characteristics (such as how often a branch is taken). This will probably change in the future, as a experimental JIT compiler for Python is currently being worked on.
However, all the above is insignificant to what was our biggest performance obstacle…
A proxy is a class that acts as a facade for arbitrary classes. In order for Java to call a Python method (or vice versa), a Foreign Function Interface must be used. A Foreign Function Interface, or FFI in short, is an interface that allows one program language to call functions of another. In Java, there are currently two different FFI implementations:
JNI and the foreign API both returns a raw pointer (in JNI, a long, and in foreign, a MemorySegment) for Python objects. However, a pointer has little use in Java code, which expects interfaces like Comparable or Function. Typically, proxies are used to convert the pointer into Comparable, Function, or some other interface.
If a proxy is empty, it is only about 1.63 slower than a direct call. However, proxies are often not empty. Often, they need to do complex tasks such as converting the arguments and results of methods. How slow are we talking? Consider this simplistic benchmark:
public class Benchmark {
public static void benchmark(Runnable runner) {
var start = System.nanoTime();
for (int i = 0; i < 10_000; i++) {
runner.run();
}
var end = System.nanoTime();
System.out.printf("%.9f\n", (end - start) / 1_000_000_000d);
}
public static void main(String[] args) {
System.out.println("Cold");
benchmark(() -> {});
System.out.println("Hot");
benchmark(() -> {});
}
}
Here are the runtimes for running benchmark with an empty Runnable in Java versus an empty Runnable implemented as a Python proxy:
Method
Runtime
Java Direct Call (Cold)
0.000232610
Java Direct Call (Hot)
0.000240525
Java FFI Direct Call (Cold)
0.005904000
Java FFI Direct Call (Hot)
0.000442477
JPype Proxy (Cold)
0.006906666
JPype Proxy (Hot)
0.004721682
GraalPy Proxy (Cold)
0.027417475
GraalPy Proxy (Hot)
0.000883732
Despite the Python code literally doing nothing, a JPype proxy is 967.102% slower than a direct Python call when hot. GraalPy fares better at only 100% slower than a direct Python call. Things get even worse when there are arguments and return types involved (for example, with java.util.function.Predicate). Here, we pass a Python list to Java (which calls a Python proxy function on each item in a list).
import jpype
jpype.startJVM(classpath=['.'])
benchmark = jpype.JClass('Benchmark')
class Item:
def __init__(self, value):
self.value = value
@jpype.JImplements('java.util.function.Predicate')
class Runner:
@jpype.JOverride
def test(self, item):
return item.value
def proxy(item):
return jpype.JProxy(jpype.JClass('java.io.Serializable'), inst=item, convert=True)
arraylist = jpype.JClass('java.util.ArrayList')
items = arraylist([proxy(Item(i % 2 == 0)) for i in range(10000)])
print('Cold')
benchmark.benchmark(Runner(), items)
print('Hot')
benchmark.benchmark(Runner(), items)
or GraalPy:
import java.Benchmark
class Item:
def __init__(self, value):
self.value = value
def test(item):
return item.value
items = [Item(i % 2 == 0) for i in range(10000)]
print("Cold")
java.Benchmark.benchmark(test, items)
print("Hot")
java.Benchmark.benchmark(test, items)
Method
Runtime
Java Direct Call (Cold)
0.000819039
Java Direct Call (Hot)
0.000559228
Java FFI Direct Call (Cold)
0.004265137
Java FFI Direct Call (Hot)
0.000807377
JPype Proxy (Cold)
0.015507570
JPype Proxy (Hot)
0.012130092
GraalPy Proxy (Cold)
0.070494332
GraalPy Proxy (Hot)
0.004472111
Here, JPype is 1402.41% slower than calling Python directly when hot! GraalPy also degraded, becoming 453.906% slower than calling Python directly.
In particular, when the arguments are passed to Java, they are converted to Java proxies. However, when the arguments are passed back to Python, they are not converted back to Python objects. Instead, a Python proxy is created for the Java proxy, creating additional overhead for every single operation.
Given the problems with proxies, the obvious solution is to just manage the opaque Python pointers and do direct Python calls. Unfortunately, using FFI directly is hard:
In order for garbage collection to work, the Java and Python garbage collectors must be aware of each other (especially if Java objects are passed to Python). In particular, if an arena is closed, it is zeroed out (to prevent use-after-free). This means if Python is still tracking the object (because it retains a reference), a segmentation fault will occur on the next garbage collection. However, if an arena is never closed, we will eventually run out of memory. This is especially true for long running processes, such as solving. JPype solves this with deep magic that links Java’s and Python’s garbage collectors. Unfortunately, calling Python functions directly using FFI is unsupported in JPype.
Java FFI is designed around starting from Java, not from Python (i.e. Java host). As a result, without doing deep magic like JPype, a separate Java process must be launched to use FFI with an independent Python interpreter.
Java objects cannot be used as Python objects directly, meaning some proxies will need to be generated if a Java object is passed to Python code.
#Avoiding all the problems (and replacing them with new ones)
So how can we avoid the performance losses from:
Global Interpreter Lock
Dynamically Typed Code
Monkey Patching Support
No JIT Compiler
Proxy Overhead
with a single solution? The (not so simple) answer is to translate Python’s bytecode and objects to Java bytecode and objects.
No Global Interpreter Lock, as Java doesn’t have one.
Can treat type annotations as strict, meaning we can determine which method to call at compile time in some cases.
Can remove monkey patching support; by the time the code is compiled, typically all monkey patching (if any) is already done.
Java has a JIT compiler.
Because the Python objects are translated to Java objects, no proxies are involved.
However, there is a not so simple part:
Unlike Java, Python’s bytecode changes drastically from version to version. Even the behavior of existing opcodes may change (i.e. Python bytecode is neither forwards nor backwards compatible).
Python’s bytecode has several quirks that makes it non-trivial to directly map it to Java bytecode. For instance, Python keeps its for loop’s iterator on the stack for the entire for loop body. As a result, when an exception is caught in a for loop, Python bytecode expects there to still be an iterator on the stack (whereas in Java, the exception handler’s stack is always empty).
The Java object result needs to be converted back to a Python object.
Annoying as these problems are, they are significantly easier to debug and test than using FFI directly; we can rely on JPype’s deep magic to keep Java’s and Python’s garbage collectors in sync, and we can have deterministic reproducers and test cases.
Is it truly worth all the hassle of translating Python’s bytecode? Well, the only way to know is to measure the speed before and after the bytecode translation. Bytecode translation was originally implemented in OptaPy (which Timefold Solver for Python is a fork of), which included an option to use the proxies instead. Since the bytecode generation became more mature, the option to use proxies was removed from Timefold Solver for Python.
Thus, to compare the performance of proxies versus translated bytecode, we will use the latest (and probably last) release OptaPy (9.37.0b0) on Python 3.10.
Here are the score calculation speeds of using proxies versus translated bytecode in OptaPy’s school timetabling quickstart (higher is better).
Note: OptaPy’s School Timetabling quickstart is not equivalent to Timefold’s School Timetabling quickstart: there are bugs in the constraints that are fixed in the Timefold versions but were never addressed in OptaPy’s version.
Method
Score Calculation Speed
Proxies
599/sec
Translated Bytecode
69558/sec
In short, using translated bytecode multiplied our throughput by 100 (i.e. an over 100,000% performance increase)!
Now finally to the topic of the blog post: how fast is Timefold Solver for Python on various PlanningAI models? Below are the move evaluation speeds of the Java and Python Timefold 1.15.0 quickstarts:
Quickstart
Dataset
Language
Move Evaluation Speed
School Timetabling
SMALL
Java 22.0.1
230143/sec
School Timetabling
SMALL
Python 3.12
64034/sec
School Timetabling
LARGE
Java 22.0.1
174881/sec
School Timetabling
LARGE
Python 3.12
42249/sec
Employee Scheduling
SMALL
Java 22.0.1
85926/sec
Employee Scheduling
SMALL
Python 3.12
27462/sec
Employee Scheduling
LARGE
Java 22.0.1
53265/sec
Employee Scheduling
LARGE
Python 3.12
15753/sec
Vehicle Routing
PHILADELPHIA
Java 22.0.1
330492/sec
Vehicle Routing
PHILADELPHIA
Python 3.12
18912/sec
Vehicle Routing
HARTFORT
Java 22.0.1
337094/sec
Vehicle Routing
HARTFORT
Python 3.12
21443/sec
In short:
School Timetabling in Python is about 75% slower than Java
Employee Scheduling in Python is about 70% slower than Java
Vehicle Routing in Python is about 95% slower than Java
If the Python bytecode is translated to Java bytecode, why is it still significantly slower than Java? There are several differences between the translated bytecode and normal Java bytecode:
The translated bytecode must be consistent with Python’s semantics. This means:
Cannot use 32-bit int or 64-bit long to represent a Python’s int, since Python’s int are BigInteger.
Although some dynamic dispatches during operations such as addition can be determined statically, others cannot either due to missing or lost type information. In Java, all dispatches are determined at compile time.
In the case a function expects a Java return type or argument, a conversion to the appropriate Java/Translated Python type must occur.
The translated bytecode is far from idiomatic Java bytecode. As such, the Java JIT compiler, designed for idiomatic Java bytecode, might have trouble optimizing it. Even Fernflower (a Java decompiler) has trouble decompiling the translated bytecode!
Inlining opportunities are limited, due to the generated bytecode’s size and usage of interfaces.
For school timetabling and employee scheduling, to get the same result as Java in Python, you should run it about four times longer duration, (so if Java runs for 1 minutes, Python should run for 4 minutes).
For vehicle routing, to get the same result as Java in Python, you should run it about twenty times longer duration, (so if Java runs for 1 minutes, Python should run for 20 minutes).
In many cases, this performance loss is acceptable, as it allows you to write your code in pure Python instead of Java, meaning:
No requirement to learn Java.
No requirement to make a separate Java model that matches your Python model.
Can use Python libraries like Pydantic, FastAPI and Pandas to create the planning problem and to process solutions.
Note: Python libraries might use functions defined in C, which naturally have no corresponding Python bytecode, and thus cannot be translated to Java bytecode.
Usage of C‑defined functions in constraints causes a necessary fallback to Python, and the Java objects need to be converted back to Python objects. This causes a massively performance degradation that has roughly the same performance as proxies.
C‑defined functions can be used outside of constraints without issue.
To make Timefold’s goal of making PlanningAI accessible to everyone a reality, everyone should be able to write in the language of their choice. For many, that language is Python.
To make writing PlanningAI models in Python practical, the Python solver must match a reasonable fraction of the Java Solver’s performance. This is a challenge, as Python provides a myriad of obstacles to performance. But like any obstacle, they can be overcome.
By overcoming these obstacles, we managed to make the Python solver match 25% of the Java solver’s performance, making writing PlanningAI models in Python practical. That being said, for use cases that truly require the most performant code, it is probably best to stick to Java.
Continue reading
Blog
LLMs can’t optimize schedules, but AI can
LLMs have come a long way in recently, but they still struggle with complex planning, however, there’s an older form of AI that handles complex planning…
Blog
How I built an AI company to save my open source project
Exactly 3 years ago, Geoffrey thought his life’s work was over. Almost 2 decades of innovation seemed lost. Today, the world-class technology that once seemed lost is thriving under a new name, Timefold. On the third anniversary of the setback that led to this moment, Geoffrey took the time to unfold how he turned the tables in a compelling blog post.
Blog
How do I upgrade from OptaPlanner to Timefold?
Learn how to upgrade from Optaplanner to Timefold.