Deadlocks in rclpy and how to prevent them with use of callback groups

Deadlocks in rclpy and how to prevent them with use of callback groups

Our recent migration of Karelics Brain from ROS 1 to ROS 2 had our whole team learning a lot of new things in a relatively short time period. As is usually the case in such situations, this learning process did not happen without its struggles and hiccups. One of the things that gave us trouble on a couple of occasions was deadlocks occurring when calling ROS services or actions in our nodes.

It took quite a few tries and plenty of studying but we did eventually manage to figure out what was causing the deadlocks and, also, how we could avoid them. Moreover, we believe that properly grasping this topic helped us get a better understanding of the whole philosophy of ROS 2. Thus, we decided to compile our findings into a short guide, hoping that it will be useful to others in the ROS community dealing with similar issues.

As a result, we present to you our brief cookbook on handling callback groups in rclpy and using them properly in order to avoid deadlocks. For those interested in C++ side of ROS 2, we give the following note: While we have not verified the minimal examples presented in this post with rclcpp, by our knowledge, our main points also hold there.

Without further ado, let’s get going.

Table of Contents

Basics of executors and callback groups

For the sake of completeness, let us start with some basic definitions and concepts related to executors and callback groups in rclpy. Feel free to skip this section or parts of it if you are already familiar with the topics.

Executors in rclpy

By default, rclpy offers two different executors for the user to choose from:

  • SingleThreadedExecutor (default)
  • MultiThreadedExecutor

SingleThreadedExecutor is quite straightforward: It executes callbacks in a single thread, one at a time, and thus the previous callback must always finish before a new one can begin execution.

MultiThreadedExecutor, on the other hand, is capable of executing several callbacks simultaneously. While the presence of GIL in Python means that it is still not able to utilize truly concurrent execution with multiple CPU cores, it does provide the often useful possibility of having different callback executions overlap with each other and thus also allows waiting in a callback, something that SingleThreadedExecutor simply cannot achieve. Here, one should note the wording “is capable of,” hinting that the parallel execution is not a given (more on that in the following sections).

SingleThreadedExecutor doesn’t really care about any callback group options. Thus we will not discuss it further, and the following topics only concern the MultiThreadedExecutor.

Callback groups for MultiThreadedExecutors

Rclpy offers two different callback groups to work with in combination with MultiThreadedExecutors:

  • MutuallyExclusiveCallbackGroup
  • ReentrantCallbackGroup

MutuallyExclusiveCallbackGroup allows the executor to execute only one of its callbacks simultaneously, essentially making it as if the callbacks in the group were executed by a SingleThreadedExecutor. Thus, it is a good choice to put any callbacks accessing critical and potentially non-thread-safe resources in the same MutuallyExclusiveCallbackGroup.

ReentrantCallbackGroup allows the executor to schedule and execute the group’s callbacks in any way the executor sees fit, without restrictions. This means that, in addition to different callbacks being run simultaneously, the executor can also execute different instances of the same callback simultaneously. An example case of this would be if the execution of a timer callback takes longer than the timer’s firing period (although that particular case should be avoided at all costs in rclpy, but that’s another story).

Using callback groups to control execution and avoid deadlocks

Before we continue, it is important to understand a couple things about callbacks from rclpy’s perspective. In the context of rclpy and executors, a callback means a function whose scheduling and execution is handled by an executor. Some examples of callbacks in this context are

  • subscription callbacks (receiving and handling data from a topic),
  • timer callbacks,
  • service callbacks (for executing service requests in a server),
  • different callbacks in action servers and clients,
  • done-callbacks of Futures.

Below are a couple important points about callbacks that should be kept in mind when working with callback groups.

  • Almost everything in ROS 2 is a callback! Every function that is run by an executor is, by definition, a callback. The non-callback functions in a ROS 2 system are found mainly at the edge of the system (user and sensor inputs etc).
  • Sometimes the callbacks are hidden and their presence may not be obvious from the user/developer API provided in rclpy. This is the case especially with any kind of “synchronous” call to a service or an action. For example, the synchronous call Client.call(request) to a service adds a Future’s done-callback that needs to be executed during the execution of the function call, but this callback is not directly visible to the user.

Controlling execution with callback groups

In order to control execution with callback groups, one can consider the following guidelines (by no means an all-inclusive list!).

  • Register callbacks accessing critical non-thread-safe resources in the same MutuallyExclusiveCallbackGroup (or protect the resources by locks manually).
  • If you have a callback whose execution instances need to be able to overlap with each other, register it to a ReentrantCallbackGroup.
  • If you have callbacks that require to be potentially executed in parallel to one another, register them to
    • a ReentrantCallbackGroup, or
    • different MutuallyExclusiveCallbackGroups (this option is good if you want the callbacks to not overlap themselves, or also need or want thread safety with respect to some other callbacks).

Note that the last point in the list is a valid way of allowing parallel execution for different callbacks, and can even be more desirable than simply registering everything into one ReentrantCallbackGroup.

Avoiding deadlocks

Every ROS(2) developer knows that making synchronous calls to services or actions in a callback is bad and can lead to deadlocks… Except that this is not quite the whole truth. While using asynchronous calls (and consequently explicitly registering done-callbacks to futures) is indeed safer and allows things to work as intended even with SingleThreadedExecutors, synchronous calls can also be made to work in callbacks as long as the node’s callback group setup is done correctly. The obvious benefit of synchronous calls is that they make the code look cleaner and easier to understand, so let us see how to make them work without risk of deadlocks.

First thing to note here is that every node’s default callback group is a MutuallyExclusiveCallbackGroup. If the user does not specify any other callback group when creating a timer, subscription, client etc., any callbacks created then or later by these entities will use the node’s default callback group. Furthermore, if everything in a node uses the same MutuallyExclusiveCallbackGroup, that node essentially acts as if it was handled by a SingleThreadedExecutor, even if a multi-threaded one is specified! Thus, whenever one decides to use a MultiThreadedExecutor, some callback groups should always be specified in order for the executor choice to make sense.

With the above in mind, here are a couple guidelines to help avoid deadlocks:

  • Use asynchronous calls (if you always use only these everywhere, there should never be a deadlock).
  • If you make a synchronous call in any type of a callback, this callback and the client making the call need to belong to
    • different callback groups (of any type), or
    • a ReentrantCallbackGroup.

Failing the latter point will always cause a deadlock. An example of such a case would be making a synchronous service call in a timer callback (see the next section).

Examples

Let us look at some simple but hopefully enlightening examples about different callback group setups. The following demo code considers calling a service synchronously in a timer callback.

You can find a Github link to the full script here. Feel free to check it out, try it out yourself and expand it for further experimentation. You could, for example, add a publisher to the service node (perhaps publish when the service is called), subscribe to it in the client node, maybe add some sleeps and see what happens with different callback groups. As a curiosity, try making the subscriber never receive anything even though the service and timer are working.

Let us first go through the main parts of the code. We have a node providing a simple mock service:

class ServiceNode(Node):
    def __init__(self):
        super().__init__('mock_service_node')
        self.srv = self.create_service(Empty, 'test_service', callback=self.service_callback)

    def service_callback(self, request, result):
        self.get_logger().info('Server received request')
        return result

and a node with a client to the above service, along with a timer (unless we specify our desire for more direct control):

class CallbackGroupDemo(Node):
    def __init__(self, client_cb_croup, timer_cb_group, manual_calls):
        super().__init__('callback_group_demo_node')
        self.client = self.create_client(Empty, 'test_service', callback_group=client_cb_croup)
        if not manual_calls:
            self.call_timer = self.create_timer(1, self.timer_cb, callback_group=timer_cb_group)

    def call_srv(self, delay: float = 1):
        sleep(delay)
        self._call_srv()

    def _call_srv(self):
        self.get_logger().info('Client sending request')
        _ = self.client.call(Empty.Request())
        self.get_logger().info('Client received response')

    def timer_cb(self):
        self._call_srv()

We then have a couple helper functions for spinning the service node and making “manual” calls from outside of the executor-realm:

def spin_srv(executor):
    try:
        executor.spin()
    except rclpy.executors.ExternalShutdownException:
        pass

def call_srv_manually(client_node):
    client_node.call_srv()
    client_node.get_logger().info('Test finished successfully.\n')

And finally the actual demo script, taking the setup arguments: callback groups for the client and timer and whether we want to send service calls manually instead of using the timer. Here, we create the above nodes, assign them to executors (MultiThreadedExecutor for the demo node) and make them spin. Finally, we clean up.

def run_test(client_cb_group, timer_cb_group, manual_call):
    rclpy.init()

    node = CallbackGroupDemo(client_cb_croup=client_cb_group, timer_cb_group=timer_cb_group, manual_calls=manual_call)
    executor = MultiThreadedExecutor()
    executor.add_node(node)

    service_node = ServiceNode()
    srv_executor = SingleThreadedExecutor()
    srv_executor.add_node(service_node)
    srv_thread = Thread(target=spin_srv, args=(srv_executor, ), daemon=True)
    srv_thread.start()

    call_thread = Thread(target=call_srv_manually, args=(node, ), daemon=True)
    if manual_call:
        call_thread.start()

    try:
        print("")
        node.get_logger().info('Beginning demo, end with CTRL-C')
        executor.spin()
    except KeyboardInterrupt:
        node.get_logger().info('KeyboardInterrupt, shutting down.\n')
    node.destroy_node()
    service_node.destroy_node()
    rclpy.shutdown()
    try:
        srv_thread.join()
    except KeyboardInterrupt:
        pass    
    if manual_call: 
        call_thread.join()

Note that setting either of the callback group options to None will cause the respective callback(s) to be assigned into the node’s default MutuallyExclusiveCallbackGroup.

Let us then look at what happens when running the test with different options. We start by just using the default callback group for everything. We see that if we call the service manually (not from a callback), everything works: run_test(client_cb_group=None, timer_cb_group=None, manual_call=True) outputs

[INFO] [1649233901.626448480] [callback_group_demo_node]: Beginning demo, end with CTRL-C
[INFO] [1649233902.622004759] [callback_group_demo_node]: Client sending request
[INFO] [1649233902.624205200] [mock_service_node]: Server received request
[INFO] [1649233902.627408435] [callback_group_demo_node]: Client received response
[INFO] [1649233902.627800491] [callback_group_demo_node]: Test finished successfully.
^C[INFO] [1649233905.490485451] [callback_group_demo_node]: KeyboardInterrupt, shutting down.

This is because there is only one callback running on the client side: the (hidden) done-callback of the get_result_future, so it cannot be blocked by anything. But, the situation changes if we try using the timer for the service calls: run_test(client_cb_group=None, timer_cb_group=None, manual_call=False) outputs

[INFO] [1649233905.498086234] [callback_group_demo_node]: Beginning demo, end with CTRL-C
[INFO] [1649233906.497072443] [callback_group_demo_node]: Client sending request
[INFO] [1649233906.497673382] [mock_service_node]: Server received request ← DEADLOCK
^C[INFO] [1649233909.797191396] [callback_group_demo_node]: KeyboardInterrupt, shutting down.

The deadlock is caused by the timer callback and the above-mentioned done-callback being in the same (node’s default) MutuallyExclusiveCallbackGroup. Because the timer callback blocks execution until the result of the service call is received, the done-callback never gets to execute, and consequently the service call never completes.

So how do we solve the above issue? If we insist on using the synchronous call, we have two options: separate the timer and service callbacks to different callback groups (of any type) or put them into one ReentrantCallbackGroup. In the former case, we can replace either or both of the groups by ones set by ourselves. Thus all of the following test cases give the same output:

run_test(client_cb_group=MutuallyExclusiveCallbackGroup(), timer_cb_group=None, manual_call=False) 

run_test(client_cb_group=None, timer_cb_group=MutuallyExclusiveCallbackGroup(), manual_call=False)  

group1 = MutuallyExclusiveCallbackGroup()
group2 = MutuallyExclusiveCallbackGroup()
run_test(client_cb_group=group1, timer_cb_group=group2, manual_call=False)

cb_group = ReentrantCallbackGroup()

run_test(client_cb_group=cb_group, timer_cb_group=cb_group, manual_call=False)

Output:

[INFO] [1649233909.812419539] [callback_group_demo_node]: Beginning demo, end with CTRL-C
[INFO] [1649233910.812076113] [callback_group_demo_node]: Client sending request
[INFO] [1649233910.812990099] [mock_service_node]: Server received request
[INFO] [1649233910.815389239] [callback_group_demo_node]: Client received response
[INFO] [1649233911.811726650] [callback_group_demo_node]: Client sending request
[INFO] [1649233911.812350167] [mock_service_node]: Server received request
[INFO] [1649233911.813846297] [callback_group_demo_node]: Client received response
^C[INFO] [1649233912.312435736] [callback_group_demo_node]: KeyboardInterrupt, shutting down.

As you can see, the timer also now keeps firing repeatedly (as expected), whereas previously the first execution got stuck and blocked any further executions of the timer callback.

The last demo case is to see what happens if we replace both callback groups with the same MutuallyExclusiveCallbackGroup (one that’s different from the node’s default one). The result is the same as when using the default group:

cb_group = MutuallyExclusiveCallbackGroup()
run_test(client_cb_group=cb_group, timer_cb_group=cb_group, manual_call=False)

outputs

[INFO] [1649233905.498086234] [callback_group_demo_node]: Beginning demo, end with CTRL-C
[INFO] [1649233906.497072443] [callback_group_demo_node]: Client sending request
[INFO] [1649233906.497673382] [mock_service_node]: Server received request ← DEADLOCK
^C[INFO] [1649233909.797191396] [callback_group_demo_node]: KeyboardInterrupt, shutting down.

The reason why this didn’t work is that the future’s done-callback – which is the critical callback getting blocked by the timer callback – is assigned to the future by the service client. Thus, that callback will use the same callback group as the client does. If it would use the node’s default one instead, things would work, at least in this particular case.

When we used two different MutuallyExclusiveCallbackGroups, it worked because the timer callback (the one being blocked by the service call) was in a different group than the client (who relayed its callback group to the Future’s done-callback). Thus, the done-callback was able to execute parallel to the timer callback, the client returned the result of the service call, and the timer callback was able to finish (and execute again the next time the timer fired)!

Epilogue

We hope that this post will be helpful to anyone struggling to find a correct callback group setup to make their ROS 2 system work efficiently and without hiccups. In particular we wish for deadlock-free development for the whole of the ROS 2 community going forward!

Link to the full example script

Facebook
Twitter
LinkedIn
Picture of Karelics team
Karelics team

Karelics is a robotics software company located in Joensuu. We develop software to allow smart autonomous robots in the construction industry. Our ambition is to allow robots to work with people intelligently and safely. …or without people.

All Posts