Handling monitoring probes

Bringing to the forum some discussion I started over mail to try to bring it to life :wink:

We have PROC07 to add a new probe to the system, but we don’t have information on (with answers from @apaolini quoted)

  • what is the environment where this probe should run?

well, you found that link in this page, https://wiki.egi.eu/wiki/SAM#Probes , which is deprecated, so we have to change the link in the procedure, for example using this one: https://wiki.egi.eu/wiki/ARGO#ARGO_tests

I cannot find a similar updated documentation for the probes, have you already asked Emir, by chance?

  • what happens if the probe cannot be easily run on that environment (e.g. some cloud probes cannot run on CentOS 6)?

++ Kernel Panic!! the procedure doesn’t foresee that a proposed probe cannot be executed by the monitoring system because OS incompatibility… ++

When the procedure was written, this use case wasn’t taken into account, or maybe it was given as granted that any probe would have been compatible with the monitoring system.

Do we need to explicitly say that our monitoring system run on CentOS 6 and that any probe have to be compatible with it? (keeping in mind that we cannot deploy a new/different monitoring system every time any probe is not compatible with the exisisting one)

Now I don’t remember precisely what is the problem with these cloud probes and why they require CentOS7 (and how happy Emir is in deploying an ad-hoc ARGO instance)

  • what is “the probe integration process” performed as part of step 6?

essentially it is what is described in this old page (integration and testing): http://argoeu.github.io/samdoc/confluence/display/SAMDOC/Probes%20Development.html considering that there is no more EMI and that SAM/ARGO is now a central service.

So if Emir can point us to an updated version of that guide, he can also update the information about the integration process

  • how an existing probe is updated? what can trigger such update? what are the criteria to make sure that the probe update is ok?

an update to a probe can be triggered by any reason justifying an improvement. The new version undergoes again to the same procedure.

Moreover, in this other procedure https://wiki.egi.eu/wiki/PROC06 the probes are tested in production before the inclusion in the operations or critical profiles. I see that it mention the quality criteria that a probes need to fulfill: https://documents.egi.eu/document/240 Perhaps they should be reviewed since they are dated 2011 and several pieces of information are deprecated.

2 Likes

@enolfc do you want these Q&A to show up as actual “solved” questions? We can split the topic up into a set of individual questions and mark the solution “solved”

OK, so what exactly are you expecting here @enolfc?

First, I would like to have clear documentation for those questions that can be used for any coming probe developer (e.g. myself wanting to do a probe for the EGI Notebooks)

And from there see where things can be automated so we can push for more agile process or even automating the deployment of things.

1 Like

If I’m not mistaken, this has been implemented ?

@enolfc do you care to comment on your experience in writing this probe?

I’m always ambivalent about the packaging thing. The guide says that you need to distribute the probe as an OS package. I wonder if this is really necessary. The Nagios API, from what I understand works on exit codes, so as long as there’s a runtime on the host, you can write your probe in any language (contrary to what ARGO’s documentation says – but admittedly that was written ~ 5 years ago).

Correct me if I’m wrong, but these probes are executed on the ARGO box itself, not one the service. So, if ARGO had a runtime for a specific language, or could execute containers (in which case there would be no need for this language constraint), we would have a much better time writing probes, right?

I’m not saying we need to go full BINFMT_MISC or anything, but it would be nice to be able to whip up a development environment for probes.

I wonder how that is being done…

This probe is designed to overcome several limitations of the current monitoring tools in EGI :slight_smile:

As we can only do black-box testing from nagios (actual actions on the service would be probably too complex to do from outside and it would require credentials from nagios being authorised or creating ad-hoc tokens), I decided to have an internal monitoring in the service that can perform relevant actions there:

Now this tries to spawn a new server every hour and reports the status of the last trial at /services/status.

This allows to refine the monitoring without changing argo/nagios stuff (and thus avoid any long, non-automated processes there) and to do actual operations that are meaningful to the status of the service.

Given that, what I needed to do from nagios is just connection to the /services/status endpoint and report whatever is reported there. Easy to implement in almost any environment… which goes to the topic you raised:

We can’t have arbitrary probes in the current Nagios box because these boxes are stuck in CentOS 6 and only accepts packages for probe distribution. That means: no docker, ancient python, ancient ruby, and 5 year-old documentation being still up to date :wink:

So my experience has been good, mostly because I have a lot of insider information and I’m deliberately avoiding the nagios box as much as I can.

2 Likes

Just heard about this Guidelines for Monitoring probes, it may provide some more info: https://docs.google.com/document/d/1fDqO0LPjRlX68D_jDm3ulxsc0J17XoZqAjMnjRhDfHI

1 Like

Thanks for the link!

It seems now I need to update the packages to follow the guidelines :unamused:

There is also the following page: https://wiki.eosc-hub.eu/display/EOSC/ARGO+Guidelines+for+monitoring+probes

1 Like

Probes are now updated to follow those guidelines. At the end it was just moving the binaries to the right location.

now waiting for ARGO team to add them: https://ggus.eu/?mode=ticket_info&ticket_id=136703

If you want, we can set a reminder on this topic to help us come back to it when the ARGO team is ready. Is there any idea of what the horizon for this is?

I want this to be ready asap, before next SSB would be the ideal.