ClusterInfo 1.3 - by Carl W. Bell|
Copyright © 2011-2013 Baylor University. All Rights Reserved.
SSH "ASKPASS" authentication based on
by Ira Cooke.
Copyright © 2009 Mudflat Software. All rights reserved.
Disclaimer: Although ClusterInfo seems to work fine, it is distributed "as is." Use at your own risk.
Note: If you got here using a web search engine, you can find the program here.
ClusterInfo is a Mac OS X application that queries the "head/admin node" of a compute cluster for information on jobs running on the compute nodes. Specifically, it runs the Torque Resource Manager commands qstat and pbsnodes on the remote system, receives and parses the responses, and then displays the info for the user. The end result is somewhat similar to the "xpbsmon" command. ClusterInfo might work with OpenPBS or PBS Pro, but I haven't tested those. Other resource managers or batch systems won't work. Note that ClusterInfo only displays info about the compute jobs and nodes. It is not intended to be used to modify queues or nodes, submit/delete jobs, etc. although this functionality could be added in the future.
Internally, ClusterInfo uses the "ssh" command to execute the qstat and pbsnodes commands on the remote system. If you have ssh key pairs set up already, it should automagically use them. If not, ClusterInfo uses the ssh "SSH_ASKPASS" mechanism for authentication. ClusterInfo will run the commands via ssh, ssh will require the user's remote password, and will run a separate program (that's actually located within ClusterInfo itself) asking for the password. After you enter it or cancel, ssh continues normally. When entering the password, you can also indicate if you want ClusterInfo to store the password in your keychain. (See the "Keychain Access" application in your Utilities folder.)
You will need to set up information for at least one remote server. Select the "Preferences" menu to open the "ClusterInfo Preferences" window and click the '+' button located below the "Servers" list. Here you enter your username on the remote server as well as the hostname (IP address) of the remote server. You can also specify an optional description. The description is useful if you have multiple entries for the same server and user.
You can also modify the qstat and pbsnodes command here, e.g., if they are located in a directory not in your $PATH, or if you want specify other options. Note that these commands are run verbatim on the remote server so it is possible to specify "/bin/delete_all_my_files" for the qstat command and it will run it. Note that ClusterInfo will not allow you to specify "root" as the user. If you do modify the commands, be sure and keep the "-x" options which tell qstat and pbsnodes to return data as XML.
ClusterInfo also allows you to run commands on the compute nodes. Essentially, it is running "ssh user@host ssh node command". There is no way to include a password when doing this. The cluster is probably set up so that you don't need to enter a password to ssh to the compute nodes. But if not, this feature won't work. Some commands, e.g., top, may require a tty, and so they probably won't work when run via ssh. You can add/delete/edit commands the same way as Servers above. When running a command on a compute node, the string "%u" is replaced with the current server's user.
Once you have set up your server info, you should be able to select it in the ClusterInfo window. Click the "Update" button to query the remote server. You may or may not have to enter your password for the remote server. If all goes well, you should get a list of jobs in the window. The Jobs/Nodes/Users tabs allow you to switch between qstat, pbnodes, and "users" info. The actual output returned from the server is located in the "Output" window. This also includes stderr so if there are problems, you may be able to use this to help fix things.
When running a command, e.g., updating, on the server, the "Update" button changes to "Cancel". Click the button to cancel the command. This should kill the command on the server. ClusterInfo will change the button to "Cancelling" and disable the button for 1 second before changing it back to "Update". This is to keep you from accidentally double-clicking the button and re-running the update.
ClusterInfo is not meant to be a "monitoring" tool so it does not periodically/automatically update the lists. If you want more current info, click the "Update" button again. The "Clear" button simply clears the lists and output text. If, for some reason, you want to temporarily disable using credentials stored in your keychain, hold down the option-key while pressing the "Update" button. You may need to do this if ClusterInfo returns "Permission denied" when attempting to update or run a remote command, e.g., you changed your password on the server but ClusterInfo is trying to use your old password that is still in the keychain.
Selecting an item in the Jobs, Nodes, or Users list will display the details for that job, node, or user in the "Selected Details" window. You can also display an item's details in individual (and persistent) windows by clicking the "Details" button.
An entry in the Jobs list includes the node(s) that the job is running on and the user who owns that job. Click the "Show Nodes" button and ClusterInfo will switch to the Nodes list but only show those nodes for that job. Click the "Show Users" button to switch to the Users list and only show the owner of that job. There are similar buttons for the Nodes and Users lists. If ClusterInfo is displaying a sub-set of jobs, nodes, or users, you can click the "Show All Jobs/Nodes/Users" button to get the full list.
Entries in the Jobs list have a colored dot that indicates their status. Green dots are running jobs; yellow dots are complete (or exiting) jobs; red dots are queued jobs. Similarly, entries in the Nodes list have a colored dot that indicate the number of running processes on a node. Green dots mean that there are no jobs running at all. Red dots mean that the number of jobs running on the node is equal (or greater) than the number of processors on that node. Yellow dots mean there are jobs running on the node, but the number is less than the max. Note that these jobs may be multi-threaded and might actually be using all of the processors although the actual number is less than the max and showing a yellow dot. A node's "Load" cell will be colored yellow if the loadave is more than the expected number of job processes + 1, or red if the loadave is more than the number of processors/cores + 1. It's possible that a job has just finished and the loadave hasn't come down yet but if they stay yellow or red for an extended time, it may be worth examining the node and see what processes are running.
If you select an item in the Nodes list, you can run a command on that node. For example, if the loadavg is higher than normal on a particular node, you may want to run "ps aux" on that node to list the processes on the node. Select the command you want to run and press the "Run" button. This will run ssh on the head node to execute the command on the compute node, e.g., "ssh server ssh node command". As mentioned above, you cannot specify a password to use for the ssh command, so if your compute nodes aren't set up to run without entering a password, this feature will not work. This uses ssh, not rsh, to run the command on the compute node.
The Users tab displays user specific information derived from both the qstat and pbsnodes command. For example, you can see the number of jobs a user is running, or which nodes the user's job(s) are running on.
The qstat and pbsnodes commands can return much more information than is normally displayed by ClusterInfo. You can see all of the info in the "Details" window. You can also choose to show/hide various columns using the "Choose Columns" menu. Select the "Defaults" to return all the columns to their default widths/locations/hidden states. Holding down the option key allows you to show all the columns.
If you have any questions, comments, (constructive) criticism, or bug reports,
you can contact me at the address(es) below.
Carl Bell's Web Page
Stuff I've Written
Carl W. Bell
Academic and Research Computing Services
Baylor University Electronic Library
One Bear Place #97148
Waco, TX 76798
Baylor's Boilerplate Fine Print
This software, data and/or documentation contain trade secrets and confidential information which are proprietary to Baylor University. Their use or disclosure in whole or in part without the express written permission of Baylor University is prohibited.
This software, data and/or documentation are also unpublished works protected under the copyright laws of the United States of America. If these works become published, the following notice shall apply:
Copyright © 2011-2013 Baylor University
All Rights Reserved
The name of Baylor University may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE, DATA AND/OR DOCUMENTATION ARE PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
When permission has been granted to make copies of this software, data and/or documentation, the above notices must be retained on all copies.
Permission is hereby granted for non-commercial use and distribution of ClusterInfo