18 ways to stabilize the computational resources of robotics


TLDR - Below are various stories about how robots failed and specific things to validate in your robotic system to minimize hidden failures due to resource issues. In most robotics equipment, value for money is overwhelmingly overlooked, leading to a lot of headaches and lower overall performance.

“More than half of the engineers we surveyed did not know what their CPU usage was like, but raised concerns about it. "

I will review a set of learnings over the past 3 years with design patterns and best practices for tuning high-quality, high-performance robotic software.

rover on the fence

If a robot becomes unstable, it can lead to a host of failures, including the scariest (and the one I've seen multiple times), the fleeing robot.These best practices have been built over time through many experiments trying to debug both the code in the multiple bots we tested and many of our clients' bots, and then we find that we didn't know. Not enough to be successful. then by designing systems or establishing design patterns to address those problems. Some of this information is ROS-centric, some is common sense, and much of it comes from having to add thousands of hours of Linux and robotics logs.

We've seen everything from runaway robots scaling walls because processes failed, to unstable algorithms that only failed under specific connectivity or lighting conditions, to unknown bugs that were easily fixed when we were able to view remotely logged data.

If you build or use robots, these tips and best practices can save your life. They were for us!

Robot is still dead on Sunday

For about a month, we had a robotic system that never felt quite stable. It worked during the week, but we began to recognize a pattern where Monday morning most of the time had turned off when we got to the office. At this point, we didn't have great resource debugging tools, so it took a while to identify the problem.

ada

This robot has a Python file that has increasing memory consumption, indicating a memory leak (see the yellow line that increases over time). These types of subtle problems can be easy to identify and narrow down to a particular process by looking at resource charts over time.

This was due to a very slow memory leak, causing a small ROS node to shut down only after several days of activity. This node was considered a "required" node, so it would shut down the entire ROS system and the robot would stop charging. Typically detecting this would require introspecting the Python-based code with YAPPI or a different tool, but when we were able to step back and visualize resource usage over time, it became very clear (see image) that there was growth linear in memory usage.

Robot climbing the walls

We once had a robot that seemed to go crazy and stopped responding to unit commands during demonstrations. This usually ended with the robot climbing up a wall or spinning in place until the emergency stop could be activated.

Over time, we realized that there was a WiFi deadlock in a location we were not expecting, combined with a low level motor driver interlock error that only occurred when disconnection had occurred, occurred during a command active driving.

Finally, by carefully monitoring the network connectivity and detecting the correlation, the error was found and fixed, and the quality of the network improved rapidly.

Safety note: We also recommend adding an IMU to all mobile robots, so that if they lean too far in any direction (or fall) they will automatically perform a safety stop. Also, although the lidar is not a guarantee of safety, it is still necessary to have a simple speed limiter so that it does not enter objects that appear near it.

 

 

 

Comments

Popular posts from this blog

Deployment on Mars: mapping and location tutorial

Steam Engine on Industrialization

robotic technologies in logistics operations