In today's world the Software Engineer often has to wear multiple hats and play the role of developer and system administrator. If you are writing code that runs on a distributed platform like Apache Spark then even more so.
Let's say for example that someone has told you that they think some server that is a part of your platform is "going slow". When you login to the Linux server what tool can you use to check and see how that machine is performing under load?
top is a simple cli tool that allows you to interactively monitor processes on the OS.
to bring it up type:
Now on the screen that greets you check the
load average section:
Say you do and you see the following:
last 1 min last 5 min last 15 min load average: 44.16, 45.91, 46.07
Note, you wont see the
last 1 min, etc as I added those for reference.
What does this mean? Well, first you need to know how many cores are in the system.
The general rule is that if the load averages are greater than the number of cpu cores on your sytem, then the system was overloaded for that time period!
The above load averages I got from a system that has 16 cores and so it was VERY overloaded! Rumor has it that this machine actually exploded shortly after I recorded these numbers.
Hit 1 to see the activity of all the cores
If you are on a multi-core machine then inside top you can type
1 to see the activity of all the cores.
Using top is not just for monitoring a single machine
Interestingly, when working in a distributed system like Apache Spark using
top is actually still very helpful. I have written Python scripts before that login to every node in the cluster and display its
top info so I can definitely see what exactly is going on in real-time on the cluster!