How not to do cluster monitoring.

The world should have blinkenlights on its computer systems. That’s a given.

I wrote a couple of things. One was a Python program that pinged the four machines forming the cluster, and displayed a red or green light on a UnicornHD HAT to show their status. It worked very nicely. Then I wrote Python code to form part of any program that would be run in parallel on the cluster, which would send a signal saying whether each core was busy or not. It worked nicely, and I now had a row of 16 LEDs, in red or green, so I could see what was going on. It was very pretty.

Unfortunately, as it worked by sending a file by FTP every time a processor core changed between running and idle, it created a very effective Denial of Service attack on our network. Oops.

Now that I have thought about it more carefully, I shall be constructing a much better monitoring system, which will be based on sockets. I’ve been avoiding learning how to use them for far too long, anyway…

Later:

I tried at least umpteen example programs using sockets, and the connections were all rejected, and I couldn’t work out how to fix that. Suggestions, anyone?

Using a Python program to query the cluster computers took nearly six seconds to look at the 16 cores, hardly blinkenlights… A quick hack of a bash script, astonishingly, took almost as long. Back to trying to get sockets to work, then…

Working sockets tutorial!

At last, I found a socket programming example that worked, here!

I wanted to give Zan a tiny donation, but sadly his GoFundMe page seems defunct, and possibly the message I tried to send him also failed…

Sadly, I was then unable to work out how to accept multiple connections from the cluster computers.

Threading sockets programs!

There’s another set of client-server demos on GitHub, here, that I tested with Marvin and two of the Oysters, to confirm that it can do what I want. I can hoik code from those while retaining the program logic, and maybe get all four Oysters to send their status to Marvin, for him to display. I am not at all bothered that I am writing control system code for the cluster, instead of getting round to some fun applications of parallelism

Success!

I now have a GUI program that runs on Marvin, which takes a program developed on Marvin, deploys it on the four Pi’s that make up the oyster cluster, then uses mpirun to run it in parallel on 16 cores, with the results appearing on Marvin.

It’s clearly time to knock off and celebrate…