CPU, temperature, memory, and disk usage
Because software must be able to swap usage between
applications, CPU, memory, and disk usage should never be close to 100 percent.
Here you can see in
what way the green CPU usage suddenly drops and then goes up again. If there
are large fluctuations in CPU usage, you need to check to identify the
processes that are triggering them and if they can cause a perfect storm above
100% processing capacity.
SUGGESTIONS ON
BEST PRACTICES:
• Total CPU <80% - Processes will grow to 100% for short
periods of time, so keep your average CPU usage low enough to be efficient.
• Total Reminiscence <70% - Similar to CPU usage, all
programs will need swap space and the ability to allocate additional chunks of
memory.
• Total Disk <75% - SSDs are so cheap now that you
shouldn't be going anywhere. If the disk reaches 95%, you should disable the
robot's ability to move or interact, as processes can arbitrarily start to die
for no reason due to file I / O failures.
• Process-level processor <60% of an individual core:
Unless you have a specific algorithm that cannot be split, most architectures
should allow decoupling between multiple ROS nodes or separate processes that
interact with each other. correctly with each other.
• Process-level memory <25% - RAM is cheap these days and
you can redesign algorithms to work in very small spaces. Most algorithm should
be able to run in less than 1 GB, leaving most of the RAM for other uses.
• CPU temperature typically <60 ° C: can reach a maximum
of 80 ° C, but should normally be well below. If it is too hot, the processor
speed is reduced and the processes that were running previously may take more
than 100% of the time. If it increases over time, it should improve cooling,
especially for outdoor robots, where the robot housing can become an oven, the
internal temperature of a robot can reach 60 ° C or even higher even without
the added heat. of the CPU, so the robot can easily overheat. With those
indoors that do not have ventilation, this can also happen.
1. Create a resource "guardian" dead robot
switch
If one of the best practices for maximum use is out of
range, disable movements and other safety interaction on the robot and report
it in an alert. If any of these basic resources fail, overall performance and
the number of system errors can skyrocket, and you can't even report a problem
or maintain control of the system.
2. Find slow memory leaks with multi-day charts
As you zoom out on resource monitoring, you can look for
memory usage ramps over time to identify leaks and correlate them to the
specific process that caused them by expanding into the details of that
process. `top` is a good place to start, but it doesn't clearly represent things
over time. You can expand any process that requires more than 1% CPU or RAM on
your computer in the Freedom Resource Monitor tab.
3. Check the PID change to reset the process.
If your process's PID (ex: ROS Node) keeps changing over
time, it will restart and this is usually caused by a crash / crash. Most of
the time this is not noticeable because the process restarts automatically, but
the point at which it failed usually masks a resource failure or a code
exception. In the resource monitor, you can see the exact start and stop time
of each process.
4. Walk away ... a lot ... and stare at the data with
narrowed eyes.
It might not sound scientific, but our brains are great
model comparators. In Resource Monitor, you can download multiple days of data
(although it might take a while!). This will allow you to starts seeing
patterns: Does RAM or CPU oscillate or increase over time and correlate with
different nodes, processes, or connectivity changes?
We have detected background packages that we have installed
that regularly use the CPU, but only once a day. This can lead to hidden
crashes in the future.
5. Update your IT and download processes
Consider upgrading your Raspberry Pi to an NVIDIA Jetson or
your NUC to a more powerful version. Many robots start with the cheapest and
weakest treatment available. This may work fine for a while, but when your
resource averages start to peak, you will have times when IT will stop working
due to spikes in usage that you can't see in the averages. .