By Christopher Chianelli October 31, 2024

Performance is a top priority for Timefold Solver, since any performance gains means we can find the same solution in a less amount of time (or alternatively, a better solution in the same amount of time). This is also true for Python, which is typically one or two orders of magnitudes slower than Java. Through a variety of techniques, we are able to make Python reach 25% of Java’s throughput. Read on to find out the methods we deploy to make Python’s performance better, and see how Timefold Solver for Python performance compares to Java on a variety of PlanningAI models.

Note: Usage of Python in this article refers exclusively to the CPython implementation.

# Why is Python relatively slow?

Python’s performance challenges are mostly due to design decisions:

# Global Interpreter Lock

The Global Interpreter Lock, or GIL for short, is a lock that prevents multiple threads from executing Python code at the same time. This is because Python’s core data structures, such as dict, list and object are not thread-safe.

When the GIL was introduced (Python 1.5, in the year 1998), multi-threaded programs were rare (most computers only supported one thread at the time), and as such, using a GIL simplifies Python extension developer experience and does not degrade single-threaded performance. The GIL was a good idea for its time, but gradually became worse as multi-threaded programs and computers became more common. Currently, Python is working on removing the GIL, but it will be several releases before it is widely available.

# Dynamically Typed

Python is dynamically typed, which means the result type of operations cannot be determined before their execution. Moreover, many operations such as addition and attribute lookup are also dynamic, since they look up attributes on their operands' types. As a result, this seemly simple python code:

def add(x, y):
    return x + y

Does something a lot more complex underneath:

def ignore_type_error_if_builtin_type(arg_type, method, x, y):
    try:
       return method(x, y)
    except TypeError:
        if arg_type in [int, str, list, ...]:
            return NotImplemented
        raise

def add(x, y):
    x_type = type(x)
    x_add = x_type.__dict__.get('__add__')
    y_type = type(y)
    y_radd = y_type.__dict__.get('__radd__')
    if x_add is None and y_radd is None:
        raise TypeError('...')
    elif y_radd is None:
        out = ignore_type_error_if_builtin_type(x_type, x_add, x, y)
        if out is NotImplemented:
            raise TypeError('...')
        return out
    elif x_add is None:
        out = ignore_type_error_if_builtin_type(y_type, y_radd, y, x)
        if out is NotImplemented:
            raise TypeError('...')
        return out
    else:
        if x_type != y_type and issubclass(y_type, x_type):
            out = ignore_type_error_if_builtin_type(y_type, y_radd, y, x)
            if out is NotImplemented:
                out = ignore_type_error_if_builtin_type(x_type, x_add, x, y)
                if out is NotImplemented:
                    raise TypeError('...')
        else:
            out = ignore_type_error_if_builtin_type(x_type, x_add, x, y)
            if out is NotImplemented:
                out = ignore_type_error_if_builtin_type(y_type, y_radd, y, x)
                if out is NotImplemented:
                    raise TypeError('...')
        return out

For custom types, this complexity causes addition to have about double the overhead of a direct method call:

import timeit

source = \
'''
class A:
    def __init__(self, value):
        self.value = value

    def __add__(self, other):
        return self.value + other.value

a = A(0)
b = A(1)
'''
timeit.timeit('a + b', source)
#  0.04903000299964333

timeit.timeit('add(a, b)', f'{source}\nadd=A.__add__')
#  0.029219600999567774

However, Python specializing adaptive interpreter speeds up the code for builtin types, such as int.

import timeit

timeit.timeit('a + b', 'a=0\nb=1')
#  0.009246169998732512

timeit.timeit('add(a, b)', 'a=0\nb=1\nadd=int.__add__')
#  0.04777865600044606

# Monkey-Patching

Python supports monkey-patching; that is, the behavior of types and functions can be modified at runtime. For example, you can change the implementation of a type’s method:

class Dog:
    def talk(self):
        return 'Woof'

a = Dog()
print(a.talk())  # Prints 'Woof'

Dog.talk = lambda self: 'Meow'
print(a.talk())  # Prints 'Meow'

Monkey-patching prevents method inlining, an optimization technique where a method is copied into its call sites, removing call overhead and allowing further optimization to be done.

# Lack of a JIT compiler

Python currently lacks a JIT compiler, meaning it cannot optimize code based on runtime characteristics (such as how often a branch is taken). This will probably change in the future, as a experimental JIT compiler for Python is currently being worked on.

However, all the above is insignificant to what was our biggest performance obstacle…

# Proxies are slow

A proxy is a class that acts as a facade for arbitrary classes. In order for Java to call a Python method (or vice versa), a Foreign Function Interface must be used. A Foreign Function Interface, or FFI in short, is an interface that allows one program language to call functions of another. In Java, there are currently two different FFI implementations:

Java Native Interface (JNI)
(Java 22 and above) Foreign Function and Memory API (foreign)

JNI and the foreign API both returns a raw pointer (in JNI, a long, and in foreign, a MemorySegment) for Python objects. However, a pointer has little use in Java code, which expects interfaces like Comparable or Function. Typically, proxies are used to convert the pointer into Comparable, Function, or some other interface.

# Why Proxies are slow

If a proxy is empty, it is only about 1.63 slower than a direct call. However, proxies are often not empty. Often, they need to do complex tasks such as converting the arguments and results of methods. How slow are we talking? Consider this simplistic benchmark:

public class Benchmark {
    public static void benchmark(Runnable runner) {
        var start = System.nanoTime();
        for (int i = 0; i < 10_000; i++) {
           runner.run();
        }
        var end = System.nanoTime();
        System.out.printf("%.9f\n", (end - start) / 1_000_000_000d);
    }

    public static void main(String[] args) {
        System.out.println("Cold");
        benchmark(() -> {});

        System.out.println("Hot");
        benchmark(() -> {});

    }
}

Which we can invoke via JPype

import jpype

jpype.startJVM(classpath=['.'])
benchmark = jpype.JClass('Benchmark')

@jpype.JImplements('java.lang.Runnable')
class Runner:
    @jpype.JOverride
    def run(self):
        pass

print("Cold")
benchmark.benchmark(Runner())

print("Hot")
benchmark.benchmark(Runner())

or GraalPy:

import java.Benchmark

def run():
    pass

print("Cold")
java.Benchmark.benchmark(run)

print("Hot")
java.Benchmark.benchmark(run)

Here are the runtimes for running benchmark with an empty Runnable in Java versus an empty Runnable implemented as a Python proxy:

Method	Runtime
Java Direct Call (Cold)	0.000232610
Java Direct Call (Hot)	0.000240525
Java FFI Direct Call (Cold)	0.005904000
Java FFI Direct Call (Hot)	0.000442477
JPype Proxy (Cold)	0.006906666
JPype Proxy (Hot)	0.004721682
GraalPy Proxy (Cold)	0.027417475
GraalPy Proxy (Hot)	0.000883732

Despite the Python code literally doing nothing, a JPype proxy is 967.102% slower than a direct Python call when hot. GraalPy fares better at only 100% slower than a direct Python call. Things get even worse when there are arguments and return types involved (for example, with java.util.function.Predicate). Here, we pass a Python list to Java (which calls a Python proxy function on each item in a list).

import jpype

jpype.startJVM(classpath=['.'])
benchmark = jpype.JClass('Benchmark')


class Item:
    def __init__(self, value):
        self.value = value


@jpype.JImplements('java.util.function.Predicate')
class Runner:
    @jpype.JOverride
    def test(self, item):
        return item.value


def proxy(item):
    return jpype.JProxy(jpype.JClass('java.io.Serializable'), inst=item, convert=True)


arraylist = jpype.JClass('java.util.ArrayList')
items = arraylist([proxy(Item(i % 2 == 0)) for i in range(10000)])

print('Cold')
benchmark.benchmark(Runner(), items)

print('Hot')
benchmark.benchmark(Runner(), items)

or GraalPy:

import java.Benchmark


class Item:
    def __init__(self, value):
        self.value = value


def test(item):
    return item.value


items = [Item(i % 2 == 0) for i in range(10000)]
print("Cold")
java.Benchmark.benchmark(test, items)

print("Hot")
java.Benchmark.benchmark(test, items)

Method	Runtime
Java Direct Call (Cold)	0.000819039
Java Direct Call (Hot)	0.000559228
Java FFI Direct Call (Cold)	0.004265137
Java FFI Direct Call (Hot)	0.000807377
JPype Proxy (Cold)	0.015507570
JPype Proxy (Hot)	0.012130092
GraalPy Proxy (Cold)	0.070494332
GraalPy Proxy (Hot)	0.004472111

Here, JPype is 1402.41% slower than calling Python directly when hot! GraalPy also degraded, becoming 453.906% slower than calling Python directly.

In particular, when the arguments are passed to Java, they are converted to Java proxies. However, when the arguments are passed back to Python, they are not converted back to Python objects. Instead, a Python proxy is created for the Java proxy, creating additional overhead for every single operation.

# Why using FFI directly is hard

Given the problems with proxies, the obvious solution is to just manage the opaque Python pointers and do direct Python calls. Unfortunately, using FFI directly is hard:

In order for garbage collection to work, the Java and Python garbage collectors must be aware of each other (especially if Java objects are passed to Python). In particular, if an arena is closed, it is zeroed out (to prevent use-after-free). This means if Python is still tracking the object (because it retains a reference), a segmentation fault will occur on the next garbage collection. However, if an arena is never closed, we will eventually run out of memory. This is especially true for long running processes, such as solving. JPype solves this with deep magic that links Java’s and Python’s garbage collectors. Unfortunately, calling Python functions directly using FFI is unsupported in JPype.
Java FFI is designed around starting from Java, not from Python (i.e. Java host). As a result, without doing deep magic like JPype, a separate Java process must be launched to use FFI with an independent Python interpreter.
Java objects cannot be used as Python objects directly, meaning some proxies will need to be generated if a Java object is passed to Python code.

# Avoiding all the problems (and replacing them with new ones)

So how can we avoid the performance losses from:

Global Interpreter Lock
Dynamically Typed Code
Monkey Patching Support
No JIT Compiler
Proxy Overhead

with a single solution? The (not so simple) answer is to translate Python’s bytecode and objects to Java bytecode and objects.

No Global Interpreter Lock, as Java doesn’t have one.
Can treat type annotations as strict, meaning we can determine which method to call at compile time in some cases.
Can remove monkey patching support; by the time the code is compiled, typically all monkey patching (if any) is already done.
Java has a JIT compiler.
Because the Python objects are translated to Java objects, no proxies are involved.

However, there is a not so simple part:

Unlike Java, Python’s bytecode changes drastically from version to version. Even the behavior of existing opcodes may change (i.e. Python bytecode is neither forwards nor backwards compatible).
Python’s bytecode has several quirks that makes it non-trivial to directly map it to Java bytecode. For instance, Python keeps its for loop’s iterator on the stack for the entire for loop body. As a result, when an exception is caught in a for loop, Python bytecode expects there to still be an iterator on the stack (whereas in Java, the exception handler’s stack is always empty).
The Java object result needs to be converted back to a Python object.

Annoying as these problems are, they are significantly easier to debug and test than using FFI directly; we can rely on JPype’s deep magic to keep Java’s and Python’s garbage collectors in sync, and we can have deterministic reproducers and test cases.

# How the translated bytecode performance?

Is it truly worth all the hassle of translating Python’s bytecode? Well, the only way to know is to measure the speed before and after the bytecode translation. Bytecode translation was originally implemented in OptaPy (which Timefold Solver for Python is a fork of), which included an option to use the proxies instead. Since the bytecode generation became more mature, the option to use proxies was removed from Timefold Solver for Python.

Thus, to compare the performance of proxies versus translated bytecode, we will use the latest (and probably last) release OptaPy (9.37.0b0) on Python 3.10.

Here are the score calculation speeds of using proxies versus translated bytecode in OptaPy’s school timetabling quickstart (higher is better).

Note: OptaPy’s School Timetabling quickstart is not equivalent to Timefold’s School Timetabling quickstart: there are bugs in the constraints that are fixed in the Timefold versions but were never addressed in OptaPy’s version.

Method	Score Calculation Speed
Proxies	599/sec
Translated Bytecode	69558/sec

In short, using translated bytecode multiplied our throughput by 100 (i.e. an over 100,000% performance increase)!

# So… how fast is Timefold Solver for Python?

Now finally to the topic of the blog post: how fast is Timefold Solver for Python on various PlanningAI models? Below are the move evaluation speeds of the Java and Python Timefold 1.15.0 quickstarts:

Quickstart	Dataset	Language	Move Evaluation Speed
School Timetabling	SMALL	Java 22.0.1	230143/sec
School Timetabling	SMALL	Python 3.12	64034/sec
School Timetabling	LARGE	Java 22.0.1	174881/sec
School Timetabling	LARGE	Python 3.12	42249/sec
Employee Scheduling	SMALL	Java 22.0.1	85926/sec
Employee Scheduling	SMALL	Python 3.12	27462/sec
Employee Scheduling	LARGE	Java 22.0.1	53265/sec
Employee Scheduling	LARGE	Python 3.12	15753/sec
Vehicle Routing	PHILADELPHIA	Java 22.0.1	330492/sec
Vehicle Routing	PHILADELPHIA	Python 3.12	18912/sec
Vehicle Routing	HARTFORT	Java 22.0.1	337094/sec
Vehicle Routing	HARTFORT	Python 3.12	21443/sec

In short:

School Timetabling in Python is about 75% slower than Java
Employee Scheduling in Python is about 70% slower than Java
Vehicle Routing in Python is about 95% slower than Java

# Where is performance lost?

If the Python bytecode is translated to Java bytecode, why is it still significantly slower than Java? There are several differences between the translated bytecode and normal Java bytecode:

The translated bytecode must be consistent with Python’s semantics. This means:
- Cannot use 32-bit int or 64-bit long to represent a Python’s int, since Python’s int are BigInteger.
- Although some dynamic dispatches during operations such as addition can be determined statically, others cannot either due to missing or lost type information. In Java, all dispatches are determined at compile time.
- In the case a function expects a Java return type or argument, a conversion to the appropriate Java/Translated Python type must occur.
The translated bytecode is far from idiomatic Java bytecode. As such, the Java JIT compiler, designed for idiomatic Java bytecode, might have trouble optimizing it. Even Fernflower (a Java decompiler) has trouble decompiling the translated bytecode!
Inlining opportunities are limited, due to the generated bytecode’s size and usage of interfaces.

# What does the result mean in practice?

For school timetabling and employee scheduling, to get the same result as Java in Python, you should run it about four times longer duration, (so if Java runs for 1 minutes, Python should run for 4 minutes).
For vehicle routing, to get the same result as Java in Python, you should run it about twenty times longer duration, (so if Java runs for 1 minutes, Python should run for 20 minutes).

In many cases, this performance loss is acceptable, as it allows you to write your code in pure Python instead of Java, meaning:

No requirement to learn Java.
No requirement to make a separate Java model that matches your Python model.
Can use Python libraries like Pydantic, FastAPI and Pandas to create the planning problem and to process solutions.

Note: Python libraries might use functions defined in C, which naturally have no corresponding Python bytecode, and thus cannot be translated to Java bytecode.

Usage of C‑defined functions in constraints causes a necessary fallback to Python, and the Java objects need to be converted back to Python objects. This causes a massively performance degradation that has roughly the same performance as proxies.

C‑defined functions can be used outside of constraints without issue.

# Conclusion

To make Timefold’s goal of making PlanningAI accessible to everyone a reality, everyone should be able to write in the language of their choice. For many, that language is Python.

To make writing PlanningAI models in Python practical, the Python solver must match a reasonable fraction of the Java Solver’s performance. This is a challenge, as Python provides a myriad of obstacles to performance. But like any obstacle, they can be overcome.

By overcoming these obstacles, we managed to make the Python solver match 25% of the Java solver’s performance, making writing PlanningAI models in Python practical. That being said, for use cases that truly require the most performant code, it is probably best to stick to Java.

Java versus Python performance benchmarks on PlanningAI models

Summary

# Why is Python relatively slow?

# Global Interpreter Lock

# Dynamically Typed

# Monkey-Patching

# Lack of a JIT compiler

# Proxies are slow

# Why Proxies are slow

# Why using FFI directly is hard

# Avoiding all the problems (and replacing them with new ones)

# How the translated bytecode performance?

# So… how fast is Timefold Solver for Python?

# Where is performance lost?

# What does the result mean in practice?

# Conclusion

Continue reading

Why PlanningAI encompasses scheduling, routing, and strategic decision-making

Different problems, different solutions: PlanningAI and GenAI compared

LLMs can’t optimize schedules, but AI can

Java versus Python performance benchmarks on PlanningAI models

Summary

# Why is Python relatively slow?

# Global Interpreter Lock

# Dynamically Typed

# Monkey-Patching

# Lack of a JIT compiler

# Proxies are slow

# Why Proxies are slow

# Why using FFI directly is hard

# Avoiding all the problems (and replacing them with new ones)

# How the translated bytecode performance?

# So…​ how fast is Timefold Solver for Python?

# Where is performance lost?

# What does the result mean in practice?

# Conclusion

Continue reading

Why PlanningAI encompasses scheduling, routing, and strategic decision-making

Different problems, different solutions: PlanningAI and GenAI compared

LLMs can’t optimize schedules, but AI can

Sign up for our newsletter

# So… how fast is Timefold Solver for Python?