A few months back, the Karelics team embarked on the perilous journey of migrating from ROS 2 Galactic to the latest LTS distribution, Humble. Galactic has already been marked as EOL (end-of-life) for half a year, so a move to Humble was of high priority to keep receiving the latest updates from the ROS 2 ecosystem. The change was promising to be beneficial for the stability of our Brain product, as we were already encountering plenty of issues which we started brushing away and marking as “fixed in Humble”. In this blog post, we aim to document our journey through the highs, the lows, and the WTFs we stumbled on while migrating our Karelics Brain to ROS 2 Humble.
High hopes and higher dreams: The journey begins
We decided to start by first migrating the navigation component over to Humble, as that would immediately enable some nice features and fixes to the issues we have been having. In-house written code was a breeze to migrate. We write most of our custom packages in Python, and those worked without much hassle in Humble, precisely as we had expected. We use the open-source Navigation2 (nav2) package to tackle the autonomous navigation task, and migrating that was not a problem.
At this point, we also decided to separate the navigation functionality and all its dependencies into a separate Docker container so that we could run it separately without worrying about other packages. To our surprise, the navigation container freshly built on top of Humble worked with the rest of our Brain packages, which were still running in Galactic and in a separate container. This helped immensely while testing, which allowed us to begin working on the navigation tuning and improvements (collision monitor!) with the new shiny things in nav2 even though the rest of the system was not migrated yet.
The first version of the nav2 container was running in no time! Still, something was missing: the STVL (Spatio-Temporal Voxel Layer), which we have been using for the costmap generation process. The problem was that STVL does not have binary packages, and building it from source has a problem with the OpenVDB version bundled with Ubuntu22.04. In this issue, the problem is described, and that was the place where we found our “band-aid” solution. We hereby want to send our thanks and regards to xouillet for sharing the binaries of OpenVDB with the fix. With those, we got the STVL to build in our container and were thus able to move forward.
After the small speedbump encountered with STVL, we still had one feature missing, and the reason for that was a small mistake we made in the past. We contributed to the STVL (Spatio-Temporal Voxel Layer) package, where we added the implementation for ClearCostmapExceptRegion. The contribution was made way back in Foxy, and we cherry-picked that into our fork of Galactic, but we forgot to create a PR to get the change backported into the Rolling branch of ROS 2 and ensure its support for the future. So first had to port our contribution into Humble, which we did and aim to merge to the original repo soon if accepted.
Fewer headaches than expected: Other Brain packages and robot-specific drivers
The rest of our Brain code was almost as easy to migrate as the navigation parts were. A couple of import changes here and there in the cpp code due to the change towards using the `.hpp` extension instead of `.h`. After a while, most nodes were building and even starting as expected. Suddenly two of the nodes started crashing due to a segmentation fault. After spending several hours brainstorming and trying things, we created a minimal docker container, including only the crashing node, so it would be easy to pass it to other developers in our team for extra help. That container had only one problem. The code ran perfectly… or, well, as well as that piece of code could run. We encountered no crashing, and the node handled the data correctly. After we looked into the differences between the containers and some trial and error, we identified the culprit. In our Dockerfile, we had specified the following line for STVL, as that fixed some issues with the Ubuntu 20.04:
We had overlooked that, and having that line caused issues with memory allocation in the cpp code. Deleting the line fixed the problems, and all the crashes were gone.
Full speed ahead after the malloc issue, and we had most of Karelics Brain and the required robot drivers running in Humble on the Robauta Sampo robot. The following steps were to migrate the simulation to make it easier for our team to develop locally. After that, the one we expected to take a while was our Jetson container and all the packages running inside it.
Some of our robots live in a simulation. But boy, is it a slow one…
With migrating our simulation environments, we started by trying out the new and shiny Gazebo Ignition, Ignition Gazebo, Gazebo… We wonder if anyone knows what is the actual name we should use for it, as it has been going back and forth. But as we here in Finland say, “Ei nimi miestä pahenna, jos ei mies nimeä” (name cannot judge people unless one makes themselves infamous). We tried the Fortress version, which is recommended for Humble. Installed with apt-get and got the example to run in no time. We moved on to launch our simulation world there. We had to fix a couple of paths to make the models visible, and we were rocking! Well, almost… Many of the obstacles we had created in that world were some of the basic shapes from the classic Gazebo to make that specific world lighter. Those blocks were now all in the shape of a unit cube. We also noticed that the simulation could not reach the simulation time step (sim_time) of 1.00, which we previously had in the classic Gazebo. The GUI was taking a massive chunk of the CPU power of the relatively high specced laptop that was used during the process, which is not acceptable as we are currently running the simulation in some cases on a bit lower performing machines for testing the communication between the Karelics Brain and Cloud.
After trying to get the simulation to run faster, trying out the headless mode or adding panels for the GUI to access things like lists of objects in the scene, we decided to try the Gazebo classic in Humble. To get a version of the simulation out and to get other colleagues started properly with development on Humble. Again, installation from binaries was simple, and we had the good old Gazebo classic running with our old configs just fine. But soon, we noticed a new problem. We ran into a new issue since we moved to have the navigation in a separate container from the simulation. When the simulation was restarted, the simulated time was always starting from zero, and this caused issues for our navigation container if we did not restart that one as well whenever the simulation was restarted. This was fighting against one of the most significant benefits of dividing our software into smaller containers. We investigated this and discovered that in one of the latest commits made to Gazebo Classic, this issue was solved by adding a parameter `initial_sim_time`. We tried installing the whole thing from source, as that change did not exist yet in the binaries. And well, we were not holding our breaths as the latest release was over a year ago! We spent a couple of hours trying to get the Gazebo to build but with no luck. Some dependencies gave us a hard time, and eventually, we gave up on it, for now, to move forward with finishing the migration since we still had one big mountain to climb, the Jetson container.
Migrating the Jetson container: more powerful – more headaches
We are also working with Nvidia’s Jetson line of products. For example, we have been running Nvidia’s visual slam implementation on the Jetson. We began installing a new Jetpack version for the Jetson since the Humble version of the visual slam requires Jetpack 5.1.1. Installing the Jetpack was relatively quick and straightforward. But that was the point where the simple stuff ended. We spent over two weeks working on the Jetson and getting all the software to run. The first hurdle was to get the Intel Realsense drivers to work. They do not yet officially support the new Jetpack versions (5+), which caused some problems. Finally, we found from one of the issues that the development branch should now support the new jetpack. We jumped on that, and there it was! We got the cameras to run! Even two cameras worked with the RealSense viewer (pretty neat!). The next step was to get the ROS camera driver to run. The new Humble version brought a couple of new parameters, while some of the old ones had been removed. The only problem in this was that the documentation on the repo had not been updated to correspond to the changes. Luckily they had a changelog deeper in the repo, which included the information on the changed parameters.
The rest of the Jetson migration went quite OK. Visual Slam had some new parameters that had to be set for it to work in our system, and the rest of the new and exciting parameters were left to be fine-tuned later. One annoying thing while working on the Jetson was a bug in the docker container
dustynv/ros:humble-pytorch-l4t-r35.3.1. Almost whenever we stopped echoing a topic or stopped a node etc., the whole terminal exited from the docker container. Well, you know what they say, good things in life never come alone!
Aftermath: Cold hard truths are always learned after a ROS distro migration
While moving towards dividing our system into multiple containers, we learned the importance of ALWAYS keeping the dependencies in the `package.xml` up-to-date. With Python nodes, this is a painful experience, especially when you first build the Dockerfile, which contains the packages that you think are required and only those. After building, you try to run your software, but suddenly a dependency is missing. Include the missing package into the image, rebuild and try to run again to see that somehow you are also depending on a third package which was not listed and even makes you wonder why on earth we depend on that one.
Once again, we learned that keeping the documentation up-to-date will save time for us and possibly for others that end up using software we make if we publish something. Good documentation of the parameters makes it much easier for the developer to understand what the parameters do and will save plenty of time and nerves when you can trust documentation instead of diving into source code to figure out how the parameter affects the program.
We wouldn’t be entirely truthful if we said the whole process of migrating to ROS 2 Humble was one without hurdles and hiccups. It took a while, and we still have a couple of issues left to be sorted out, but it has already given us many benefits! Simply migrating solved a handful of open issues we had before. Many third-party packages offer a wide range of great new features in Humble branches, which are now already helping make our Karelics Brain even more powerful! As always, we have learned a lot from this process, and we are sure that in the future, migrating to new ROS 2 flavours will be easier and more streamlined due to our extensive experience and the great work put in by all the people involved with developing and maintaining the ROS ecosystem!