18 ways to stabilize the computational resources of robotics
TLDR - Below
are various stories about how robots failed and specific things to validate in
your robotic system to minimize hidden failures due to resource issues. In most
robotics equipment, value for money is overwhelmingly overlooked, leading to a
lot of headaches and lower overall performance.
“More than
half of the engineers we surveyed did not know what their CPU usage was like,
but raised concerns about it. "
I will
review a set of learnings over the past 3 years with design patterns and best
practices for tuning high-quality, high-performance robotic software.
rover on
the fence
If a robot
becomes unstable, it can lead to a host of failures, including the scariest
(and the one I've seen multiple times), the fleeing robot.These best practices
have been built over time through many experiments trying to debug both the
code in the multiple bots we tested and many of our clients' bots, and then we
find that we didn't know. Not enough to be successful. then by designing
systems or establishing design patterns to address those problems. Some of this
information is ROS-centric, some is common sense, and much of it comes from
having to add thousands of hours of Linux and robotics logs.
We've seen
everything from runaway robots scaling walls because processes failed, to
unstable algorithms that only failed under specific connectivity or lighting
conditions, to unknown bugs that were easily fixed when we were able to view
remotely logged data.
If you build
or use robots, these tips and best practices can save your life. They were for
us!
Robot is
still dead on Sunday
For about a
month, we had a robotic system that never felt quite stable. It worked during
the week, but we began to recognize a pattern where Monday morning most of the
time had turned off when we got to the office. At this point, we didn't have
great resource debugging tools, so it took a while to identify the problem.
ada
This robot
has a Python file that has increasing memory consumption, indicating a memory
leak (see the yellow line that increases over time). These types of subtle
problems can be easy to identify and narrow down to a particular process by
looking at resource charts over time.
This was due
to a very slow memory leak, causing a small ROS node to shut down only after
several days of activity. This node was considered a "required" node,
so it would shut down the entire ROS system and the robot would stop charging.
Typically detecting this would require introspecting the Python-based code with
YAPPI or a different tool, but when we were able to step back and visualize
resource usage over time, it became very clear (see image) that there was
growth linear in memory usage.
Robot
climbing the walls
We once had
a robot that seemed to go crazy and stopped responding to unit commands during
demonstrations. This usually ended with the robot climbing up a wall or
spinning in place until the emergency stop could be activated.
Over time,
we realized that there was a WiFi deadlock in a location we were not expecting,
combined with a low level motor driver interlock error that only occurred when
disconnection had occurred, occurred during a command active driving.
Finally, by
carefully monitoring the network connectivity and detecting the correlation,
the error was found and fixed, and the quality of the network improved rapidly.
Safety note:
We also recommend adding an IMU to all mobile robots, so that if they lean too
far in any direction (or fall) they will automatically perform a safety stop.
Also, although the lidar is not a guarantee of safety, it is still necessary to
have a simple speed limiter so that it does not enter objects that appear near
it.
Comments
Post a Comment