An introduction to PORTABLE BATCH SYSTEM

The Portable Batch System, PBS, is a batch job and computer system resource management package. It will accept batch jobs ( shell scripts with control attributes), preserve and protect the job until it is run, run the job, and deliver output back to the submitter.

PBS may be installed and configured to support jobs run on a single system, or many systems grouped together. Because of the flexibility of PBS, the systems may be grouped in many fashions.

1.1 PBS features

platform availability:

· - Cray using Unicos 8, 9, 10 or MK2

· - IBM 590, IBM SP using AIX 4

· - Silicon Graphics systems using IRIX 5.x or 6.x

· - Sun Sparc using SunOS 4.1 or Solaris 2.5 (5.5)

· - AMD/Intel/Cyrix systems using Linux or FreeBSD

automatic load-leveling
file staging
job interdependency

· - execution order

· - conditioned execution

· - synchronization

security and authorizations (ACL based)

allow or deny access on a per-system, per-group, and/or per-user basis

username mapping
parallel jobs support
job accounting
provides a graphical user interface (xpbs, xpbsmon)
comprehensive API to

· - write new commands

· - integrate PBS in applications

· - implement particular scheduler policies

1.2 Components of PBS

PBS consist of four major components: commands, the job Server, the job executor, and the job Scheduler. A brief description of each is given here.

Commands

PBS supplies both command line commands and a graphical interface. These are used to submit, monitor, modify, and delete jobs. The commands can be installed on any system type supported by PBS and do not require the local presence of any of the other components of PBS. There are three classifications of commands:

· User commands: qsub, qstat, , qdel, qselect, qrerun, qorder, qmove, qhold, qalter, qmsg, qrls

· Operator commands: qenable, qdisable, qrun, qstart, qstop, qterm

· Administrator commands: qmgr, pbsnodes

Operator and administrator commands require different access privileges.

Job Server

The Job Server is the central focus for PBS. Within this document, it is generally referred to as the Server or by the execution name pbs_server. All commands and the other daemons communicate with the Server via an IP network. The Server's main function is to provide the basic batch services such as receiving/creating a batch job, modifying the job, protecting the job against system crashes, and running the job (placing it into execution). One server manages one or more queues; a batch queue consists of a collection of zero or more batch jobs and a set of queue attributes. Jobs are said to reside in the queue or be members of the queue. In spite of the name, jobs residing in a queue need not be ordered first in, first out. Access to a queue is limited to the server which owns the queue. All clients gain information about a queue or jobs within a queue through batch requests to the server. Two main types of queues are defined: routing queues and execution queues. When a job resides in an execution queue, it is a candidate for execution. A job in execution is still a member of the execution queue from which it was selected for execution. When a job resides in a routing queue, it is a candidate for routing to a new destination. Each routing queue has a list of destinations to which jobs may be routed. The new destination may be a different queue within the same server or a queue under a different server. The Job Server must know the list of nodes that can execute jobs: they are declared in a file in the server private directory PBS_HOME/server_priv.

Job Executor

The job executor is the daemon which actually places the job into execution. This daemon, pbs_mom, is informally called Mom as it is the mother of all executing jobs. Mom places a job into execution when it receives a copy of the job from a Server. Mom creates a new session as identical to a user login session as is possible. For example, if the user's login shell is csh, then Mom creates a session in which .login is run as well as .cshrc. Mom also has the responsibility for returning the job's output to the user when directed to do so by the Server. There must be a Mom running on every node that can execute jobs.

Job Scheduler

The Job Scheduler is another daemon which contains the site's policy controlling which job is run and where and when it is run. Because each site has its own ideas about what is a good or effective policy, PBS allows each site to create its own Scheduler. When run, the Scheduler can communicate with the various Moms to learn about the state of system resources and with the Server to learn about the availability of jobs to execute. The interface to the Server is through the same API as the commands. In fact, the Scheduler just appears as a batch Manager to the Server.

In addition to the above major pieces, PBS also provides a Application Program Interface, API, which is used by the commands to communicate with the Server. This API is described in the section 3 man pages furnished with PBS. A site may make use of the API to implement new commands if so desired.

1.3 Interactions between PBS components in a multiple hosts configuration

The batch system fits into a client - server model, with a batch client making a request of a batch server and the server replying. This client - server communication necessitates an interprocess communication method supportable over a network and a data exchange (data encoding) format. While the basic PBS system fits nicely into the client - server model, it also has aspects of a transaction system so the batch system data exchange protocol has been built on top of a reliable stream connection protocol - TCP/IP and the socket interface to the network -.

The following figure shows how PBS components co-operate in a multiple hosts configuration:

1.4 Jobs overview

The basic object that PBS manages is the job. Each job which is submitted to the system can:

be batch or interactive
define a list of required resources
define a priority
define the time of execution
send a mail to user when execution start, end or abort
define dependencies (after, afterOk, afterNotOk, before, ... )
be synchronized with other jobs
be check-pointed (if the host OS provide for)

1.5 Resources

Each submitted job can specify a list of resources required for its execution: the number and the type of the specifiable resources is platform-dependent; here is a incomplete list of names that are common to all systems:

cput: max CPU time used by all processes in the job
pcput: max CPU time used by any single process in the job
mem: max amount of physical memory used by the job
pmem: max amount of physical memory used by any process of the job
vmem: max amount of virtual memory used by the job
pvmem: max amount of virtual memory used by any process of the job
walltime: wall clock time running
file: the largest size of any single file that may be created by the job
host: name of the host on which job should be run
nodes: number and/or type of nodes to be reserved for exclusive use by the job
...

For each resource it's possible to specify min-max limits and default values in queues and server attributes to filter different classes of jobs. If a running job exceeds the amount of the resource requested it will be aborted by the Mom.

2. User commands

This section describes the commands available to the general user. See the corresponding man page for complete documentation When more than one operand is specified on the command line, the command processes each operand in turn. An error reply from a server on one operand will be noted in the standard error stream. The command continues processing the other operands. If an error reply was received for any operand, the final exit status for the command will be greater than zero.

Besides the PBS commands you can use xpbs which provides a user-friendly point-and-click interface to manage jobs and xpbsmon, a GUI for displaying, monitoring the nodes/execution hosts under PBS.

2.1 Job identifiers

When the term job identifier is used, the identifier is specified as:

sequence_number[.server_name][@server]

The sequence_number is the number supplied by the server when the job was submitted.The server_name component is the name of the server which created the job. If it is missing, the name of the default server will be assumed. @server specifies the current location of the job.

When the term fully qualified job identifier is used, the identifier is specified as:

sequence_number.server[@server]

The @server suffix is not required if the job is still resides at the original server which created the job. The qsub command will return a fully qualified job identifier.

2.2 Directing requests to correct server

A command performs its function by sending the corresponding request for service to a batch server. The choice of batch servers to which to send the request is governed by the following ordered set of rules:

For those commands which require or accept a job identifier operand, if the server is specified in the job identifier operand as @server, then the batch requests will be sent to the server named by server.
For those commands which require or accept a job identifier operand and the @server is not specified, then the command will attempt to determine the current location of the job by sending a Locate Job batch request to the server which created the job.
If a server component of a destination is supplied via the -q option, such as on qsub and qselect, but not qalter, then the server request is sent to that server.
The server request is sent to the server identified as the default server, (see the environment variable PBS_DEFAULT).

2.3 qalter - alter pbs batch job

qalter [-a date_time] [-A account_string] [-c interval] [-e path] [-h hold_list] [-j join] [-k keep] [-l resource_list] [-m mail_options] [-M user_list] [-N name] [-o path] [-p priority] [-r c] [-S path] [-u user_list] [-W additional_attributes] job_identifier...

The qalter command modifies the attributes of the job or jobs specified by job_identifier on the command line. Only those attributes listed as options on the command will be modified. If any of the specified attributes cannot be modified for a job for any reason, none of that job's attributes will be modified.

2.4 qdel - delete pbs batch job

qdel [-W delay] job_identifier ...

The qdel command deletes jobs in the order in which their job identifiers are presented to the command. A job that has been deleted is no longer subject to management by batch services. A batch job may be deleted by its owner, the batch operator, or the batch administrator. A batch job being deleted by a server will be sent a SIGTERM signal following by a SIGKILL signal. The time delay between the two signals is an attribute of the execution queue from which the job was run (settable by the administrator). This delay may be overridden by the -W option. .

2.5 qhold - hold pbs batch jobs

qhold [-h hold_list] job_identifier ...

The qhold command requests that a server place one or more holds on a job. A job that has a hold is not eligible for execution. There are three supported holds: USER, OTHER (also known as operator), and SYSTEM.

A user may place a USER hold upon any job the user owns. An "operator", who is a user with "operator privilege", may place ether an USER or an OTHER hold on any job. The batch administrator may place any hold on any job.

If no -h option is given, the USER hold will be applied to the jobs described by the job_identifier operand list.

If the job identified by job_identifier is in the queued, held, or waiting states, then all that occurs is that the hold type is added to the job. The job is then placed into held state if it resides in an execution queue.

If the job is in running state, then the following additional action is taken to interrupt the execution of the job. If checkpoint/restart is supported by the host system, requesting a hold on a running job will:

cause the job to be check-pointed
the resources assigned to the job will be released, and
the job is placed in the held state in the execution queue.

If checkpoint/restart is not supported, qhold will only set the requested hold attribute. This will have no effect unless the job is rerun with the qrerun command.

2.6 qmove - move a pbs batch job to another queue

qmove destination job_identifier ...

To move a job is to remove the job from the queue in which it resides and instantiate the job in another queue. Destination can specify a queue, a server or a specific queue in a server. A job in the Running , Transiting , or Exiting state cannot be moved.

2.7 qmsg - send a message into standard output/error of pbs batch jobs

qmsg [-E] [-O] message_string job_identifier ...

To send a message to a job is to write a message string into one or more output files of the job (standard output or standard error). Typically this is done to leave an informative message in the output of the job. The qmsg command writes messages into the files of jobs by sending a Message Job batch request. The qmsg command does not directly write the message into the files of the job: only sends a request to the batch server that owns the job.

2.8 qorder - exchange order of two pbs batch jobs in a queue

qorder job_identifier job_identifier

To order two jobs is to exchange the jobs positions in the queue or queues in which the jobs resides. The two jobs must be located at the same server. No attribute of the job, such as priority is changed. The impact of interchanging the order with the queue(s) is dependent on local job scheduled policy. A job in the running state cannot be reordered.

2.9 qrerun - rerun a pbs batch job

qrerun job_identifier ...

The qrerun command directs that the specified jobs are to be rerun if possible. To rerun a job is to terminate the session leader of the job and return the job to the queued state in the execution queue in which the job currently resides. If a job is marked as not rerunable then the rerun request will fail for that job. See the -r option on the qsub and qalter commands.

2.10 qrls - release hold on pbs batch job

qrls [-h hold_list] job_identifier ...

The qrls command removes or releases holds which exist on batch jobs. A job may have one or more types of holds which make the job ineligible for execution. The types of holds are USER, OTHER, and SYSTEM. The different types of holds may require that the user issuing the qrls command have special privilege. Typically, the owner of the job will be able to remove a USER hold, but not an OTHER or SYSTEM hold. An Attempt to release a hold for which the user does not have the correct privilege is an error and no holds will be released for that job. If no -h option is specified, the USER hold will be released. If the job has no execution_time pending, the job will change to the queued state. If an execution_time is still pending, the job will change to the waiting state.

2.11 qselect - list the job identifiers in accordance with selection criteria

qselect [-a [op]date_time] [-A account_string] [-c [op]interval] [-h hold_list] [-l resource_list] [-N name] [-p [op]priority] [-q destination] [-r rerun] [-s states] [-u user_list]

The qselect command provides a method to list the job identifier of those jobs which meet a list of selection criteria. Jobs are selected from those owned by a single server. When qselect successfully completes, it will have written to standard output a list of zero or more jobs which meet the criteria specified by the options. Each option acts as a filter restricting the number of jobs which might be listed. With no options, the qselect command will list all jobs at the server which the user is authorized to list (query status of). The -u option may be used to limit the selection to jobs owned by this user or other specified users.

2.12 qstat - show status of pbs batch jobs

qstat [-f][-W site_specific] [job_identifier... | destination...]

qstat [-a|-i|-r] [-n] [-s] [-G|-M] [-R] [-u user_list] [job_identifier... | destination...]

qstat -Q [-f][-W site_specific] [destination...]

qstat -q [-G|-M] [destination...]

qstat -B [-f][-W site_specific] [server_name...]

The qstat command is used to request the status of jobs, queues, or a batch server. The requested status is written to standard out. When requesting job status, synopsis format 1 or 2, qstat will output information about each job_identifier or all jobs at each destination. Jobs for which the user does not have status privilege are not displayed. When requesting queue or server status, synopsis format 3 through 5, qstat will output information about each destination.

2.13 qsub - submit a new pbs job

qsub [-a date_time] [-A account_string] [-c interval] [-C directive_prefix] [-e path] [-h] [-I] [-j join] [-k keep] [-l resource_list] [-m mail_options] [-M user_list] [-N name] [-o path] [-p priority] [-q destination] [-r c] [-S path_list] [-u user_list] [-v variable_list] [-V] [-W additional_attributes] [-z] [script]

To create a job is to submit an executable script to a batch server. The batch server will be the default server unless the -q option is specified. See discussion of PBS_DEFAULT under Environment Variables below. Typically, the script is a shell script which will be executed by a command shell such as sh or csh.

Options on the qsub command allow the specification of attributes which affect the behavior of the job.

The qsub command will pass certain environment variables in the Variable_List attribute of the job. These variables will be available to the job. The value for the following variables will be taken from the environment of the qsub command: HOME, LANG, LOGNAME, PATH, MAIL, SHELL, and TZ. These values will be assigned to a new name which is the current name prefixed with the string "PBS_O_". For example, the job will have access to an environment variable named PBS_O_HOME which have the value of the variable HOME in the qsub command environment. In addition to the above, the following environment variables will be available to the batch job.

PBS_O_HOST the name of the host upon which the qsub command is running.

PBS_O_QUEUE the name of the original queue to which the job was submitted.

PBS_O_WORKDIR the absolute path of the current working directory of the qsub command.

PBS_ENVIRONMENT set to PBS_BATCH to indicate the job is a batch job, or to PBS_INTERACTIVE to indicate the job is a PBS interactive job, see -I option.

PBS_JOBID the job identifier assigned to the job by the batch system.

PBS_JOBNAME the job name supplied by the user.

PBS_NODEFILE the name of the file contain the list of nodes assigned to the job (for parallel and cluster systems).

PBS_QUEUE the name of the queue from which the job is executed.

2.14 nqs2pbs - convert NQS job scripts to PBS

nqs2pbs nqs_script [pbs_script]

This utility converts a existing NQS job script to work with PBS and NQS. The existing script is copied and PBS directives, #PBS , are inserted prior to each NQS directive #QSUB or #@$ , in the original script. Certain NQS date specification and options are not supported by PBS. A warning message will be displayed indicating the problem and the line of the script on which it occurred.

If any unrecognizable NQS directives are encountered, an error message is displayed. The new PBS script will be deleted if any errors occur.

3. Submitting MPI parallel jobs

The PBS batch system can be used to manage the nodes allocation in a cluster of hosts. For example, using a particular job script, it's possible to communicate to the MPI launcher program (mpirun) the number and the list of nodes that PBS has allocated for the whole job as requested from the user. The PBS server will not run more jobs on the busy nodes until the end of the current job. Here is an example of script to do this over a Myrinet network using a implementation of MPICH over GM (a proprietary protocol developed by Myricom); in such a script normally you have only to change the number of nodes required, the working directory and the name executable MPI program.

#!/bin/sh

#! example of job file to submit parallel MPI applications

#! lines starting with #PBS are options for the qsub command

#! Number of nodes (in this case I require 4 nodes with 2 CPU each)

#! The total number of nodes passed to mpirun will be nodes*ppn

#PBS -l nodes=4:ppn=2

#! Name of output files for std output and error;

#! if non specified defaults are <job-name>.o<job number> and <job-name>.e<job-number>

#PBS -e test.err

#PBS -o test.log

#! Mail to user when job terminate or abort

#PBS -m ae

#!change the working directory (default is home directory)

cd <working directory>

echo Running on host `hostname`

echo Time is `date`

echo Directory is `pwd`

echo This jobs runs on the following processors:

echo `cat $PBS_NODEFILE`

#! Counts the number of processors

NPROCS=`wc -l < $PBS_NODEFILE`

echo This job has allocated $NPROCS nodes

#! Create a machine file for Myrinet

echo $NPROCS >$PBS_JOBID.nodefile

awk '{if ($0 in vett) print $0 " " 7; else print $0 " " 6 ; vett[$0]="x"}' $PBS_NODEFILE >>$PBS_JOBID.nodefile

#! Run the parallel MPI executable (change the default a.out)

/usr/local/mpi-myri/bin/mpirun.ch_gm --gm-v --gm-f $PBS_JOBID.nodefile --gm-kill 30 -np $NPROCS a.out

rm $PBS_JOBID.nodefile

A better solution is to substitute the standard MPI launcher (mpirun) which uses the rsh mechanism to run the application on the nodes with new launcher program using the task manager library of PBS to spawn copies of the executable on all the nodes. The goals of a such program are:

Integration of parallel jobs into PBS.
Proper accounting and enforcement of cpu time and resource use for parallel jobs (all processes are children of pbs_mom, of course there are ways around it).
More scalable, reliable parallel job startup for clusters.

One implementation of this scheme for the Myricom net is the program mpiexe which integrates PBS with the MPICH implementation over GM. In this case the example script can be simplified:

#!/bin/sh

#! example of job file to submit with qsub

#! lines starting with #PBS are options for the qsub command

#! Number of nodes (8 in this case)

#PBS -l nodes=4:ppn=2

#! Name of output files for std output and error;

#! if non specified defaults are <job-name>.o<job number> and <job-name>.e<job-number>

#PBS -e test.err

#PBS -o test.log

#! Mail to user when job terminate or abort

#PBS -m ae

#! This job's working directory

echo Working directory is $PBS_O_WORKDIR

#!cd <working directory>

echo Running on host `hostname`

echo Time is `date`

echo Directory is `pwd`

echo This jobs runs on the following processors:

echo `cat $PBS_NODEFILE`

#! option to kill all the processes if one of them dies

export GMPIRUN_KILL=1 # or in csh: setenv GMPIRUN_KILL 1

export GMPIRUN_VERBOSE=1 # or in csh: setenv GMPIRUN_VERBOSE 1

#! Run the parallel MPI executable - it's possible to redirect stdin/stdout of all processes

#! using "<" and ">" - including the double quotes

/usr/local/bin/mpiexec -bg a.out

4. System configuration

Without any specification, the installation phase will produce a working PBS system with the following defaults:

User commands are installed in /usr/local/bin.
The daemons and administrative commands are installed in /usr/local/sbin.
The working directory (PBS_HOME) for the daemons is usr/spool/pbs.
The Scheduler will be the C based scheduler "fifo".

Once that the system has been built and installed, the Server and Moms must be configured and the scheduling policy must be implemented. These items are closely coupled. Managing which and how many jobs are scheduled into execution can be done in several methods. Each method has an impact on the implementation of the scheduling policy and server attributes. An example is the decision to schedule jobs out of a single pool (queue) or divide jobs into one of multiple queues each of which is managed differently. If you want to run jobs on more than a single computer, you will need to install the execution daemon (pbs_mom) on each host where jobs are expected to execute. If you are running the default scheduler, fifo, you will need to fill a nodes file (PBS_HOME/server_priv/nodes) with one entry for each execution host specifying, if appropriate, the number of processors per host. For example:

node1 np=4

node2 np=4

node3 np=2

node4 np=2

If you write your own Scheduler, it can be told in ways other than the Server's nodes file on which hosts jobs could be run.

4.1 qmgr - pbs batch system manager command

qmgr [-a] [-c command] [-e] [-n] [-z] [server...]

The qmgr command provides an administrator interface to the batch system. The command reads directives from standard input. The syntax of each directive is checked and the appropriate request is sent to the batch server or servers. The list or print subcommands of qmgr can be executed by general users. Creating or deleting a queue requires PBS Manager privilege. Setting or unsetting server or queue attributes requires PBS Operator or Manager privilege. The server operands identify the name of the batch server to which the administrator requests are sent. Each server conforms to the following syntax: host_name[:port] where host_name is the network name of the host on which the server is running and port is the port number to which to connect. If server is not specified, the administrator requests are sent to the local server.

A qmgr directive is one of the following forms:

   command server [names] [attr OP value[,attr OP value,...]]

   command queue [names] [attr OP value[,attr OP value,...]]

   command node [names] [attr OP value[,attr OP value,...]]

Where, command is the command to perform on a object. Commands are:

active

sets the active objects. If the active objects are specified, and the name is not given in a qmgr cmd the active object names will be used.

create

is to create a new object, applies to queues and nodes.

delete

is to destroy an existing object, applies to queues and nodes.

set

is to define or alter attribute values of the object.

unset

is to clear the value of attributes of the object. Note, this form does not accept an OP and value, only the attribute name.

list

is to list the current attributes and associated values of the object.

print

is to print all the queue and server attributes in a format that will be usable as input to the qmgr command.

names is a list of one or more names of specific objects The name list is in the form: [name][@server][,queue_name[@server]...] with no intervening white space. The name of an object is declared when the object is first created. If the name is @server, then all the objects of specified type at the server will be effected.

attr specifies the name of an attribute of the object which is to be set or modified. If the attribute is one which consist of a set of resources, then the attribute is specified in the form: attribute_name.resource_name

OP operation to be performed with the attribute and its value:

set the value of the attribute. If the attribute has a existing value, the current value is replaced with the new value.

increase the current value of the attribute by the amount in the new value.

decrease the current value of the attribute by the amount in the new value.

value the value to assign to an attribute. If the value includes white space, commas or other special characters, such as the # character, the value string must be inclosed in quote marks (").

4.2 Starting the daemons

All three of the daemon processes, Server, Scheduler and Mom, must run with the real and effective uid of root. Typically, the daemons are started from the systems boot files, e.g. /etc/rc.local. However, it is recommended that the Server be brought up "by hand" the first time and configured before being run at boot time.

Starting mom

Mom should be started at boot time. Typically there are no required options. It works best if Mom is started before the Server on every node so they will be ready to respond to the Server's "are you there?" ping. Start Mom with the line:

{sbindir}/pbs_mom [options]

in the /etc/rc2 or equivalent boot file. If the Server or Scheduler are running on a different host, the host name(s) must be specified in Mom's configuration file; see the pbs_mom configuration section.

Starting the server

The initial run of the Server or any first time run after recreating the home directory must be with the -t create option:

{sbindir}/pbs_server -t create

This option directs the Server to discard any existing configuration files, queues and jobs, and initialize configuration files to the default values. This is best done by hand. At this point it is necessary to configure the Server. See the pbs_server configuration section.

After the Server is configured it may be placed into service. Normally it is started in the system boot file via a line such as:

{sbindir}/pbs_server [options]

The -t start_type option may be specified where start_type is one of the options (hot|warm|cold) specified in the pbs_server man page. The default is warm.

Starting the scheduler

The Scheduler should also be started at boot time. Start it with an entry in the /etc/rc2 or equivalent file:

{sbindir}/pbs_sched [options]

There are no required options for the default fifo scheduler.

4.3 Configuring the Execution Server, pbs_mom

The function of pbs_mom is to place jobs into execution as directed by the server, establish resource usage limits, monitor the job's usage, and notify the server when the job completes. If they exist, pbs_mom will execute a prologue script before executing a job and an epilogue script after executing the job. The next function of pbs_mom is to provide information about the status of running jobs, memory available etc. as response of a resource monitor request typically submitted by the PBS scheduler. Pbs_mom will record a diagnostic message in a log file for any error occurrence. The log files are maintained in the mom_logs directory below the home directory of the server (default /usr/spool/PBS/mom_logs). If the log file cannot be opened, the diagnostic message is written to the system console.

Mom must know the name of the server that manages it: it must be declared in the file PBS_HOME/server_name. The Mom's configuration is achieved via a configuration file which is reads at initialization time and when Mom receive a SIGHUP signal. This file is described in the pbs_mom(8) man page as well as in the following section. If the -c option is not specified when Mom is run, she will open PBS_HOME/mom_priv/config if it exists. If it does not, Mom will continue anyway. The configuration file must be "secure": it must be owned by a user id and group id less than 10 and not be world writtable.

The file provides several types of run time information to pbs_mom: static resource names and values, external resources provided by a program to be run on request via a shell escape, and values to pass to internal set up functions at initialization (and re-initialization).

Each item type is on a single line with the component parts separated by white space. If the line starts with a hash mark (pound sign, #), the line is considered to be a comment and is skipped. An example of configuration file is:

$logevent 0x0ff             #enables logging of all events except debug events

$clienthost fe.widget.com   #mom will accept privileged connections from this host

                            #typically host where server and scheduler run

$restricted *.widget.com    #mom will accept connections from this host

                            #typically hosts on which a monitoring tool

                            #(as xpbsmon) can be run

$ideal_load 2.0             #When the load average on the node drops below this value

                            #Mom inform the server that the node is no longer busy

$max_load   3.5             #When the load average on the node exceeds this value

                            #Mom inform the server that the node is busy

$cputmult 1.3               #factor used to adjust cpu time usage by to job to allow

                            #comparison with different cpu performance nodes

$wallmult 1.3               #factor used to adjust wall time usage of the job to allow

                            #comparison with different cpu performance nodes

$usecp bevyboss.widget.com:/u/home /r/home   #Inform mom to use cp instead of rcp or scp

                            #to transfer file from/to that destination because it's NFS mounted

tape8mm 2                   #inform the mom about the value of a static resource

                            #(e.g. number of resources)

The directories and files involved are:

$PBS_SERVER_HOME/mom_priv the default directory for configuration files, typical (/usr/spool/PBS)/mom_priv.

$PBS_SERVER_HOME/mom_logs directory for log files recorded by the server.

$PBS_SERVER_HOME/mom_priv/config the default configuration file

$PBS_SERVER_HOME/mom_priv/prologue the administrative script to be run before job execution.

$PBS_SERVER_HOME/mom_priv/epilogue the administrative script to be run after job execution.

4.4 Configuring the Job Server, pbs_server

Server management consist of configuring the Server attributes and establishing queues and their attributes. Unlike Mom and the Job Scheduler, the Job Server (pbs_server) is configured while it is running, except for the nodes file. Configuring server and queue attributes and creating queues is done with the qmgr command. This must be either as root or as a user who has been granted PBS Manager privilege. Exactly what needs to be set depends on your scheduling policy and how you chose to implement it. The system needs at least one queue established and certain server attributes initialized.

The following are the "minimum required" server attributes and the recommended attributes; see the pbs_server_attributes man page for a complete list of server attributes. They are set via the set server (s s) subcommand to the qmgr command.

default_queue Declares the default queue to which jobs are submitted if a queue is not specified on the qsub command. The queue must be created first. Example: Qmgr: c q dque queue_type=execution Qmgr: s s default_queue=dque

acl_hosts A list of hosts from which jobs may be submitted. Example: Qmgr: s s acl_hosts=*.foo.bar.com,boss.hq.bar.com

acl_host_enable Enables the Server's host access control list, see above. Qmgr: s s acl_host_enable=true

default_node Defines the node on which jobs are run if not otherwise directed.Example: Qmgr: s s default_node=big

managers Defines which users, at a specified host, are granted batch system administrator privilege. For example,Qmgr: s s managers=me@*.foo.bar.com,sam@big.foo.bar.com

node_pack Defines the order in which multiple cpu cluster nodes are allocated to jobs.

resources_defaults This attribute establishes the resource limits assigned to jobs that were submitted without a limit and for which there are no queue limits. See the pbs_resources_* man page for your system type (* is irix6, linux, solaris5, ...). Example Qmgr: s s resources_defaults.cput=5:00 Qmgr: s s resources_defaults.mem=4mb

resources_max This attribute sets the maximum amount of resources which can be used by a job entering any queue on the Server. This limit is checked only if there is not a queue specific resources_max attribute defined for the specific resource.

Queues Configuration

There are two types of queues defined by PBS, routing and execution. A routing queue is a queue used to move jobs to other queues which may even exist on different PBS Servers. Routing queues are similar to the old NQS pipe queues. A job must reside in an execution queue to be eligible to run. The job remains in the execution queue during the time it is running.

A Server may have multiple queues of either or both types. A Server must have at least one queue defined. Typically it will be an execution queue; jobs cannot be executed while residing in an routing queue.

Queue attributes fall into three groups: those which are applicable to both types of queues, those applicable only to execution queues, and those applicable only to routing queues. If an "execution queue only" attribute is set for a routing queue, or vice versa, it is simply ignored by the system. However, as this situation might indicate the administrator made a mistake, the Server will issue a warning message about the conflict. The same message will be issued if the queue type is changed and there are attributes that do not apply to the new type.

Not all of the Queue Attributes are discussed here, only what is needed to get a reasonable system up and running. See the pbs_queue_attributes man page for a complete list of queue attributes.

queue_type Must be set to either execution or routing (e or r will do). The queue type must be set before the queue can be enabled. Example: Qmgr: s q dque queue_type=execution

enabled If set to true, jobs may be enqueued into the queue. If false, jobs will not be accepted.

started If set to true, jobs in the queue will be processed, either routed by the Server

route_destinations (Only for routing queues) List the local queues or queues at other Servers to which jobs in this routing queue may be sent. For example: Qmgr: s q routem route_destinations=dque,overthere@another.foo.bar.com

resources_max If you chose to have more than one execution queue based on the size or type of job, you may wish to establish maximum and minimum values for various resource limits. This will restrict which jobs may enter the queue and will override the same resource resources_max defined at the Server level. If there is no maximum value declared for a resource type, there is no restriction on that resource. For example: s q dque resources_max.cput=2:00:00 places a restriction that no job requesting more than 2 hours of cpu time will be allowed in the queue. There is no restriction on the memory, mem, limit a job may request.

resources_min Defines the minimum value of resource limit specified by a job before the job will be accepted into the queue. If not set, there is no minimum restriction.

resources_default Defines a set of default values for jobs entering the queue that did not specify certain resource limits. There is a corresponding server attribute which sets a default for all jobs.

The limit for a specific resource usage is established by checking various job, queue, and server attributes. The following list shows the attributes and their order of precedence:

1. The job attribute Resource_List, i.e. what was requested by the user.

2. The queue attribute resources_default.

3. The Server attribute resources_default.

4. The queue attribute resources_max.

5. The Server attribute resources_max.

Please note, an unset resource limit for a job is treated as an infinite limit.

Recording Server Configuration

Should you wish to record the configuration of a Server for re-use, you may use the print subcommand of qmgr. For example,

qmgr -c "print server" > /tmp/server.con

will record in the file server.con the qmgr subcommands required to recreate the current configuration including the queues. The commands could be feed back into qmgr via standard input:

qmgr < /tmp/server.con

It isn't necessary to do this at every pbs_server startup because (unless -t create is specified) it maintains current configuration in a private database (server_priv/serverdb)

4.5 Configuring the Scheduler, pbs_sched

PBS provides a separate process to schedule which jobs should be placed into execution. This is a flexible mechanism by which you may implement a very wide variety of policies. In fact it is possible to implement a replacement Scheduler using the provided APIs which will enforce the desired policies. The configuration required for a Scheduler depends on the Scheduler itself. The delivered FIFO Scheduler provides the ability to sort the jobs in several different ways, in addition to FIFO order. There is also the ability to sort on user and group priority. Mainly this Scheduler is intended to be a jumping off point for a real Scheduler to be written. A good amount of code has been written to make it easier to change and add to this Scheduler. As distributed, the fifo Scheduler is configured with the following options, see file PBS_HOME/sched_priv/sched_config:

All jobs in a queue will be considered for execution before the next queue is examined.
The queues are sorted by queue priority.
The jobs within each queue are sorted by requested cpu time (cput). The shortest job is places first.
Jobs which have been queued for more than a day will be considered starving and heroic measures will be taken to attempt to run them.
Any queue whose name starts with "ded" is treated as a dedicated time queue. Jobs in that queue will only be considered for execution if the system is in dedicated time as specified in the dedicated_time configuration file. If the system is in dedicated time, jobs not in a "ded" queue will not considered. (See file PBS_HOME/sched_priv/dedicated_time)
Prime time is from 4:00 AM to 5:30 PM. Any holiday is considered non-prime. Standard federal holidays for the year 1998 are included. (See file PBS_HOME/sched_priv/holidays)
A sample dedicated_time and resource group file are also included.
These system resources are checked to make sure they are not exceeded: mem (memory requested) and ncpus (number of CPUs requested).

Change directory into PBS_HOME/sched_priv and edit the scheduling policy config file sched_config, or use the default values. This file controls the scheduling policy (which jobs are run when).The format of the sched_config file is:

name: value [prime | non_prime | all]

name and value may not contain any white space value can be: true | false | number | string any line starting with a '#' is a comment. A blank third word is equivalent to "all" which is both prime and non-prime. The associated values as shipped as defaults are shown in braces {}. Here is some of scheduler attributes you can set:

round_robin {false all} boolean: If true - run jobs one from each queue in a circular fashion; if false - run as many jobs as possible up to queue/server limits from one queue before processing the next queue. The following server and queue attributes, if set, will control if a job "can be" run: resources_max, max_running, max_user_run, and max_group_run. See the man pages pbs_server_attributes and pbs_queue_attributes.

by_queue {true all} boolean: If true - the jobs will be run from their queues; if false - the entire job pool in the Server is looked at as one large queue.

strict_fifo {false all} boolean: If true - will run jobs in a strict FIFO order. This means if a job fails to run for any reason, no more jobs will run from that queue/server that scheduling cycle. If strict_fifo is not set, large jobs can be starved, i.e., not allowed to run because a never ending series of small jobs use the available resources. Also see the server attribute resources_max and the fifo parameter help_starving_jobs below.

fair_share {false all} boolean: This will turn on the fair share algorithm. It will also turn on usage collecting and jobs will be selected using a function of their usage and priority(shares).

load_balancing {false all} boolean: If this is set the Scheduler will load balance the jobs between a list of time-shared hosts (:ts) obtained from the Server (pbs_server). The Server reads the list from its nodes file.

help_starving_jobs boolean: This bit will have the Scheduler turn on its rudimentary starving jobs support. Once jobs have waited for the amount of time give by starve_max, they are considered starving, i.e. no jobs will run until the starving job can be run. Starve_max needs to be set also.

starve_max The amount of time before a job is considered starving. This config variable is not used if help_starving_jobs is not set.

sort_by {shortest_job_first} string: have the jobs sorted. sort_by can be set to a single sort type or multi_sort. If set to multi_sort, multiple key fields are used. Each key field will be a key for the multi sort. The order of the key fields decides which sort type is used first. Possible sort keys: no_sort, shortest_job_first, longest_job_first, smallest_memory_first, largest_memory_first, high_priority_first, low_priority_first, multi_sort, fair_share, large_walltime_first, short_walltime_first.

log_filter {256} What event types not to log. The value should be the addition of the event classes which should be filtered (i.e. ORing them together). The numbers are defined in src/include/log.h. NOTE: those numbers are in hex and log_filter is in base 10.

dedicated_prefix {ded} The queues with this prefix will be considered dedicated queues.

Example of FIFO Configuration file

#Set the boolean values which define how the scheduling policy finds

#the next job to consider to run.

round_robin: False      ALL

by_queue: True          prime

by_queue: false         non-prime

strict_fifo: true       ALL

fair_share: True        prime

fair_share: false       non-prime

# help jobs which have been waiting too long

help_starving_jobs: true        prime

help_starving_jobs: false       non-prime

# Set a multi_sort

# This example will sort jobs first by ascending cpu time requested, and then

# by ascending memory requested, and then finally by descending job priority

sort_by: multi_sort

key: shortest_job_first

key: smallest_memory_first

key: high_priority_first

# Set the debug level to only show high level messages.

# Currently this only shows jobs being run

debug_level: high_mess

# a job is considered starving if it has waited for this long

max_starve:     24:00:00

# If the Scheduler comes by a user which is not currently in the resource group

# tree, they get added to the "unknown" group. The "unknown" group is in roots

# resource group. This says how many shares it gets.

unknown_shares: 10

# The usage information needs to be written to disk in case the Scheduler

# goes down for any reason. This is the amount of time between when the

# usage information in memory is written to disk. The example syncs the

# information ever hour.

sync_time: 1:00:00

# What events do you not want to log. The event numbers are defined in

# src/include/log.h. NOTE: the numbers are in hex, and log_filter is in

# base 10.