ROS 2 Alternative middleware report

As stated in an earlier Discourse post, the ROS 2 core team is developing an alternative middleware RMW alongside the existing DDS RMWs. The eventual goal is to create a Tier-1 RMW that will be shipped with ROS 2 in future releases. However, the short-term goal for Jazzy is to have a source-installable RMW that the community can download, compile, and try out for themselves.

But that is getting ahead of ourselves a bit. The first question is: which middleware should we build a new RMW around? To answer that, the ROS 2 core team spent July and August doing research and listening to feedback from the community. The result of that work, along with a decision on which middleware we are going to use, is now available here: 2023-09 ROS 2 RMW alternate.pdf (1.2 MB)

After reading the report, if you have questions or comments, please reply to this thread. We are happy to discuss the contents of the report.

35 Likes

Just pulling out the conclusion from that report, since, Iā€™m sure, thatā€™s what everyone is most interested in:

The research has concluded that Zenoh best meets the requirements, and will be chosen as an alternative middleware. Zenoh was also the most-recommended alternative by users.

12 Likes

DDS has been a pain point for us (and occasionally still continues to be) when migrating to ROS2. Happy to see that weā€™re not alone!
Having a non-DDS RMW thatā€™s just plug and play like ROS1 will really help in ROS2 adoption. Robotic-ists in academic domains Iā€™ve interacted with are hesitant to move to ROS2 because things didnā€™t work out of the box on their networks. These folks then advice and build open source robotics packages for industry - so the word spreads out.

3 Likes

You should add a spoiler alert!! :wink:

Extremely happy to see this conclusion, having been witness to numerous network-wide crashes from DDS multicast packet storms both in PickNikā€™s offices and in customersā€™ homes and labs. This will likely quickly become our de facto RMW at PickNik for customers who donā€™t have specific middleware requirements once it exists.

2 Likes

Zenoh is a wise choice, thank you.

I think @clalancette summarized things well in the report.

The DDS stack works well when it is carefully tuned and operated on a well-managed
network, as evidenced by the successful use of ROS 2 in mission-critical systems around
the world. The issues described in the previous sections are surmountable in any particular
deployment, but they often require expert application-specific DDS configuration.

I am going to request that this discussion stay on topic and not devolve into a discussion of everyoneā€™s personal problems with configuring DDS. Our intent with the RMW layer has always been to have an abstraction layer where individuals can bring new middleware implementations to the table. Weā€™re finally getting a chance to step back and make this a reality.

Having said that, we are really interested in hearing everyoneā€™s opinionā€™s on this choice, and how you think we should roll this change out to the broader ROS community. If you want to volunteer to help test this new RMW, that would be great. :wink:

5 Likes

This is great news!

As long-time Zenoh users (and contributors to the experimental Zenoh RMW), the Migeran Team welcomes this decision.

We actually started a PoC implementation back in 2022 using a modified cbindgen to generate the Rust stubs for the RMW API. Still, unfortunately, we could not pursue it further due to the lack of funding.

I would be very interested in what implementation strategies the ROS2 core team is contemplating, because - in addition to adding Zenoh to the ROS2 stack - this decision also presents an opportunity to make Rust a first-class citizen in ROS2.

I would definitely advocate for creating a Safe Rust-based RMW API, which could be implemented by the Zenoh RMW. Then only a minimal unsafe Rust to C binding would be necessary to integrate with the rest of the system.

This strategy would have 3 benefits:

Why would anyone want to bypass the good old rcl + RMW APIs?

With the proposed solution a node written in Rust could be done in almost entirely in safe Rust code, where the compiler could reason about the whole stack, not just the application part. This will become increasingly important as ROS2 is going to be used more and more in safety-critical systems. In such cases the Ferrocene (Ferrocene - Ferrous Systems) Rust toolchain could be used with this setup.

What do you think?

Kind regards,
Gergely

7 Likes

As the project lead for the Eclipse Zenoh Project and on behalf of all the committers and contributors I would like to thank (1) @clalancette and the rest of the team for the extremely throughout evaluation and (2) the ROS User community for the trust in Zenoh. We feel extremely honoured by the outcome of this evaluation. We equally feel the responsibility that comes with it. Looking forward to making this happen.

Thank you from the bottom of the hearts and that of our Blue Dragon :wink:

zenoh-love

12 Likes

For the initial implementation, the goal is to slot Zenoh into the existing C API. That is, the fact that Zenoh is implemented in Rust is completely hidden behind the RMW layer. That will allow us to quickly iterate, and for users to easily be able to try it out alongside the existing supported RMW implementations.

As future work, we could definitely consider more of a Rust-based API like you are proposing.

3 Likes

I totally understand and support that you would first target the RMW API, as it is defined today.

My question is rather about the internal implementation, whether you intend to use the Zenoh C/C++ binding to implement the current RMW API or will you implement the RMW API in Rust (either using cbindgen or some other means).

In my opinion, based on the PoC we did earlier, going with Rust directly would be beneficial in terms of performance and the time required to develop the RMW implementation.

We looked at this quite a bit, and weā€™re going to go with the zenoh-c binding. It exposes almost all of the features necessary, and the team that is going to do this is far more familiar with C/C++ than Rust. From our prototyping, the hard part isnā€™t what language it is written in anyway; it is dealing with things like graph introspection, and how/whether to launch the zenohd router.

From an embedded perspective, this is a very nice decision. Rust support on most embedded platforms is relatively spotty ā€“ if itā€™s there at all.

From that same perspective: for ā€œC APIā€ based RMWs, there appears to be a trend to implement the main functionality in C++ and then wrap this in C to also offer a C API for the RMW.

Given there are still platforms which donā€™t come with up-to-date C++ compilers, doing it the other way around (like how rclcpp wraps rcl) would be very helpful to help scale ROS 2 to such platforms (so a _c RMW which gets wrapped, if needed/desired, by a _cpp implementation).

Zenoh seems to support it (perhaps with a detour via Zenoh-Pico), and having an RMW available for (severely) limited platforms with 100% native (wire) compatibility with the ā€œdesktopā€ version of the same RMW would be priceless.


Edit: I just found @esteveā€™s work on an rmw_zenoh_pico_cpp from some time ago. Seems to use Zenoh-Pico.


Edit 2: Zeno-C is a wrapper around the Rust implementation of course. Iā€™m not suggesting the use of Zeno-C ā€˜solvesā€™ anything directly, but migration between Zenoh-C and Zenoh-Pico seems to at least be mentioned/documented here, which might help maintain a Zenoh-Pico based RMW.

We looked at this quite a bit, and weā€™re going to go with the zenoh-c binding. It exposes almost all of the features necessary, and the team that is going to do this is far more familiar with C/C++ than Rust. From our prototyping, the hard part isnā€™t what language it is written in anyway; it is dealing with things like graph introspection, and how/whether to launch the zenohd router.

Is there a design document already available for review? There are many different ways to use Zenoh as a ROS backend, e.g. is it planned to be compatible with the Zenoh-DDS plugin or will it be an opt-in feature with a separate ā€œoptimizedā€ mode?

I am also curious about why a node would have to launch the router. In our Zenoh designs, we handle the router as an infrastructure component that is handled by the deployment system (e.g. Docker Compose) and not by each ROS node.

I recently reviewed the report Zettascale put out some time ago after reading the report in detail [1] and had a question that seems overlooked both in this report and your report above: Ye Olā€™ Overhead

As we noted (@alsora & @mjcarroll & I) in a paper [2] discussing the Impact of ROS 2 Node Composition on Robotics Systems, the overhead of all the middlewares currently are kind of absurd and we see shocking reduction in CPU / memory overhead just by throwing the nodes into the same component container (even without IPC). Weā€™re talking numbers dropping on the order of 30% for each metric on both ARM and x86 when running a full robotics systems including perception, planning, control, etc. 1/3 compute saved by middleware while still churning out all the algorithms, not yet accounting for things like high-res images or pointclouds.

I am curious if some attention has been given to Zenoh to compare its CPU / memory usage compared to Cyclone, Fast-DDS. Should we expect this high overhead trend to continue or potentially be improved upon? Will we still see the same speed-up on Composition Nodes which are psuedo-required right now to make a practical system work?

[1] Comparing the Performance of Zenoh, MQTT, Kafka, and DDS Ā· Zenoh - pub/sub, geo distributed storage, query
[2] https://arxiv.org/pdf/2305.09933.pdf

3 Likes

Not yet, no. We are still looking at the different ways to do things.

Thatā€™s not a primary goal now, though my understanding is that once we have a Zenoh network, we will be able to use the Zenoh-DDS bridge.

We definitely wouldnā€™t launch a router per-node. We are considering launching a router per context or per machine. The idea being that, by default, we would use the router for all discovery. On the local machine, it is overkill. But if you want to connect to a remote machine, then you enter in the IP address into your Zenoh/ROS configuration, and you are then connected. In this respect, it would be much like ROS 1 was. This neatly sidesteps all of the issues people have with multicast UDP not working. Additionally, having the router around means that we can store additional graph data there.

But none of this is set in stone. We are still exploring what to do here, so if you have specific design ideas, we are happy to hear them.

In short; I donā€™t know.

The slightly longer version of this is that some of the overhead comes from the RMW implementation, some comes from the layering of rclcpp ā†’ rcl ā†’ rmw ā†’ rmw implementation, and some comes from rclcpp and the executors themselves. I donā€™t know how much of the overhead comes from each of those issues. This work will do nothing about the latter two; that is a separate effort. rmw_zenoh and Zenoh proper may have less overhead, but it is hard to tell right now.

And just to be totally clear; the goal for this first version in Jazzy is to have something that reasonably works (and hopefully passes all of the tests). Performance profiling/improvements are something to look forward to for the K-Turtle implementation.

There is something I was always missing in ROS. The guarantee, that the publisher is actually connected to the network. E.g. code like this will likely not work as expected in ROS

    auto publisher = n.create_publisher("foo");
    // published before any subscriber was connected
    publisher->publish(someMsg);

It would be nice to have some kind of API, to figure out that the publisher was announced, and all available subscribers have connected. (I am aware of the subscriber count API, but in order to use this, you would need to have a priori knowledge about the number of subscribers.)

Perhaps keep this in the back of your head while adding the new network layerā€¦

@JM_ROS

i guess this is kinda off topic from RMW implementation, but just out of curiosity.

It would be nice to have some kind of API, to figure out that the publisher was announced, and all available subscribers have connected.

Can we just use ROS 2 graph? we can get node name, namespace, topic type and endpoint information as well.

(I am aware of the subscriber count API, but in order to use this, you would need to have a priori knowledge about the number of subscribers.)

How can the publisher know all concerned subscriptions are ready w/o having a priori knowledge? Can you share the example here what you are looking for?

i think ROS 2 graph provides the good information for ROS 2 connectivity map, but the problem for us was that is not synchronized nor event-driven.

this has been addressed with RMW matched_event with iron or later.

I am interested in what we are missing here.

thanks,
Tomoya

Not sure if zenoh rmw will be similar to a roscore-driven ROS 1 network, but in ROS 1, the roscore knows about all participants, so it knows the number of subscribers waiting for a publisher to appear.