In today's world the Software Engineer often has to wear multiple hats and play the role of developer and system administrator. If you are writing code that runs on a distributed platform like Apache Spark then even more so.

Let's say for example that someone has told you that they think some server that is a part of your platform is "going slow". When you login to the Linux server what tool can you use to check and see how that machine is performing under load?

Enter top

top is a simple cli tool that allows you to interactively monitor processes on the OS.

to bring it up type:

top

Now on the screen that greets you check the load average section:

Say you do and you see the following:

              last 1 min   last 5 min   last 15 min
load average: 44.16,       45.91,       46.07

Note, you wont see the last 1 min, etc as I added those for reference.

What does this mean? Well, first you need to know how many cores are in the system.

Exit top.

run:

nproc

or

lscpu

The general rule is that if the load averages are greater than the number of cpu cores on your sytem, then the system was overloaded for that time period!

The above load averages I got from a system that has 16 cores and so it was VERY overloaded! Rumor has it that this machine actually exploded shortly after I recorded these numbers.

Hit 1 to see the activity of all the cores

If you are on a multi-core machine then inside top you can type 1 to see the activity of all the cores.

Using top is not just for monitoring a single machine

Interestingly, when working in a distributed system like Apache Spark using top is actually still very helpful. I have written Python scripts before that login to every node in the cluster and display its top info so I can definitely see what exactly is going on in real-time on the cluster!


Comments

comments powered by Disqus