PBS job scheduling application and description

Transfer from: http://blog.sciencenet.cn/blog-355217-464900.html

PBS (Portable Batch System) was originally developed by NASA ’s Ames Research Center, mainly to provide a software package that can meet the needs of heterogeneous computing networks for flexible batch processing, especially to meet the needs of high-performance computing, such as cluster systems, Supercomputers and massively parallel systems. The main features of PBS are: open code, free access; support for batch processing, interactive operations and serial, multiple parallel operations, such as MPI, PVM, HPF, MPL; PBS is the most complete function, the oldest and the most widely supported One of the local cluster schedulers. PBS currently includes three main branches of openPBS, PBS Pro and Torque. Among them, OpenPBS is the earliest PBS system, and there is not much follow-up development at present. PBS pro is the commercial version of PBS with the most abundant functions. Torque is an open source version that Clustering took over OpenPBS and gave subsequent support.

The application of PBS is different from the general direct operation:  mpirun --np number ./executable_file

If you run the above sentence directly, you can only perform parallel calculations on a single node. If you want to execute on multiple nodes in parallel, you need to write machinefile or p4pgfile. For the specific writing of the two files, refer to Zhang Linbo and other "Introduction to Parallel Computing". The running commands are:

       mpirun –machinefile filename

       mpirun --p4pg filename

Applying PBS to submit tasks will form a task queue and execute them in sequence, effectively allocating resources and avoiding resource competition. Otherwise, CPU time slices will take turns assigning tasks to everyone, thus affecting everyone's normal work.

torque PBS provides control over batch jobs and distributed computing nodes.

  • Install Torque components: install pbs_server on one node (head node), install pbs_mom on all computing nodes, and install PBS client on all computing nodes and submit nodes. At least do the most basic configuration to make the Torque system run, that is, the pbs_server can know which machines to talk to.
  • Create a job submission queue on pbs_server.
  • Specify a cluster name as the property on all nodes of the cluster. This can be done with the qmgr command. such as:

                  qmgr -c "set node node properties = cluster-name".

  • Ensure that the job can be submitted to the node. This can be done by using the qsub command. such as:

                 echo "sleep 30" | qsub -l nodes = 3.

1.0 Torque personal installation summary (PBS) for job submission system 
1.1 Torque installation (on the master management node)
1. Unzip the installation package

tar -zxvf torque-2.3.0.tar.gz

2. Go to the unzipped folder

./configure --with-default-server=master

make

make install

3. Pack, <user> must be an ordinary user

1) [root @ master torque-2.3.0] #  ./torque.setup <user>

2) [root @ master torque-2.3.0] #  make packages

Copy the generated tpackages, torque-package-clients-linux-x86-64.sh, torque-package-mom-linux-x86-64.sh to all nodes.

3) Client installation

[root@master torque-2.3.0]# ./torque-package-clients-linux-x86_64.sh -install

[root@master torque-2.3.0]# ./torque-package-mom-linux-x86_64.sh -install

4) Edit/var/spool/torque/server_priv/nodes (need to establish yourself) and add the following content

master   np=4

node01 np=4

........

node09 np=4

5) Start pbs_server, pbs_sched, pbs_mom, and write it to/etc/rc.local to enable it to start automatically after booting.

6) Create a queue

[root@master ~]# qmgr

create queue students

set queue students queue_type = Execution

set queue students Priority = 40

set queue students resources_max.cput = 96:00:00

set queue students resources_min.cput = 00:00:01

set queue students resources_default.cput = 96:00:00

set queue students enabled = True

set queue students started = True

4. On node0x (x = 1-9, on the calculation node)

[root@node0x torque-2.3.0]# ./torque-package-clients-linux-x86_64.sh --install

[root@node0x torque-2.3.0]# ./torque-package-mom-linux-x86_64.sh --install

Then start pbs_mom and write pbs_mom to/etc/rc.local

1.2 Using Torque PBS 
1. Create a user under the root of the master

useradd test

passwd test

Enter test password

Go to/var/yp and make

2. Configure ssh for ordinary users

su test

ssh-keygen -t dsa

cd .ssh

cat id_pub.dsa >> authorized_keys

chmod 600 authorized_keys

3. Write job script, see below

4. Start mpd

mpdboot -n 10 -f mfa

mfa content:

master:4

node01:4

….

node09:4

5. Submit, query, delete jobs

Submit job: qsub pbsjob

[test1@master pbstest]$ qsub pbsjob

48. There will be a job number after the master job is submitted

Query job: qstat

[test1@master pbstest]$ qstat

Delete job: qdel job number

[test1@master pbstest]$ qdel 48

2.0 PBS service start operation process 
       I successfully operated on Dawing! !

       1) Open the PBS service on the master node

             /etc/init.d/pbs_server start

       2) Open the PBS client on the master node and other nodes. Although the master node is the server, it can also participate in the calculation, so it is necessary to open customer service. Perform in order as follows:

             /etc/init.d/pbs_mom   start

       3) Open the scheduler on all nodes

             /etc/init.d/maui.d     start

There are several identical parameters for the function activation of these PBSs:

              status View status

              restart

              stop

              start

       4) The next step is to check whether the job can be submitted

              pbsnodes --a

Returning free means that the job can be submitted.

       5) Write a script vim pbs_ fdtd_TE_xyPML_MPI_OpenMP

#!/bin/bash

#PBS -l nodes=5:ppn=4                        specifies the number of nodes used and how many cores each node can run ppn

#PBS -N taskname                            any task name taskname

cd $PBS_O_WORKDIR                       to the working directory (this is the environment variable provided by PBS)

mpirun -np 20 ./fdtd_TE_xyPML_MPI_OpenMP

The execution of mpirun can be specified with the -machinefile or -p4pg command parameters

       6) Submit

              qsub pbs_ fdtd_TE_xyPML_MPI_OpenMP

       7) You can use qstat to view the job tasks. See the following for specific parameters. The process is over!

3.0 Common commands and options of PBS 
3.1 Basic script writing and options 
PBS is an abbreviation of Protable Batch System and is a task management system. When multiple users use the same computing resource, each user submits their own tasks using the PBS script, and PBS manages these tasks and allocates resources. Here is a simple PBS script:

#!/bin/bash

#PBS -l nodes=20

#PBS -N snaphu

#PBS -j oe

#PBS -l walltime=24:00:00

#PBS -l cput=1:00:00

#PBS -q dque

cd $PBS_O_WORKDIR

cat PBS

NODEFILE

PBS_NODEFILE> NODEFILE

mpirun -np ./mpitest

Save this script as submit and then qsub submit will submit the mpitest task to the system. #PBS in the script is a script option used to set some parameters.

#PBS -l indicates a resource list, which is used to set some parameters required for a specific task. The nodes here represent the number of nodes that can be used in a parallel environment, while walltime represents the maximum time limit of the task, and cput represents the maximum time limit of cpu time. When the running time and cpu usage time exceed the corresponding time limit, the task will exit with a timeout. These three parameters are not PBS script parameters, but parameters required by the parallel environment.

#PBS -N indicates the task name.

#PBS -j means system output, if it is oe, standard error output (stderr) and standard output (stdout) are merged into stdout, if eo, then merged into stderr, if not set or set to n, stderr Separated from stdout.

#PBS -q indicates the queue selected by the current task. In a parallel environment, there are often multiple queues in a system. After the task is submitted, it will be excluded from the selected queue. You can use qstat -q to check which queues are in the system.

      The PBS script file is composed of script options and running scripts.

      1) PBS job script option (if there is no -C option, add '#PBS' in front of each item)


      2) The format of the running script is the same as the general running script file under LINUX as follows:

         mpirun-np process number./executable program name

3.2 PBS commands and options 
4 commands provided by PBS are used for job management

1. qsub command: used to submit job scripts

Command format:

qsub       [-a date_time]  

[-e path] [-I]   [-l resource_list]     

[-M user_list] [-N name]

[-S path_list]   [-u user_list]  

[-W additional_attributes]      

Example: # qsub aaa.pbs Submit a job, the system will generate a job number

2. qstat command: used to query job status information

Command format:

qstat [-f][-a][-i] [-n][-s] [-R] [-Q][-q][-B][-u]

Parameter Description:

-f jobid List the information of the specified job

-a List all system operations

-i List jobs that are not running

-n List the nodes assigned to this operation

-s List the suggestions provided by the queue manager and scheduler

-R List disk reservation information

The -Q operator is the destination id, indicating that the queue status is requested     

-q List the status of the queue and display it in alternative form

-au userid list all jobs for the specified user

-B List PBS Server information

-r List all running jobs

-Qf queue List the information of the specified queue

-u If the operator is a job number, its status is listed.

If the operator is a destination id, the job status of the users in user_list running on it is listed.

Example: # qstat -f 211 Query the specific information of the job with the job number 211.

3. qdel command: used to delete submitted jobs

Command format:

qdel [-W interval time] job number

Example: # qdel -W 15 211 Delete job with job number 211 after 15 seconds

4. qmgr command: used for queue management

      qmgr -c "create queue batch queue_type=execution"

      qmgr -c "set queue batch started=true"

      qmgr -c "set queue batch enabled=true"

      qmgr -c "set queue batch resources_default.nodes=1"

      qmgr -c "set queue batch resources_default.walltime=3600"

      qmgr -c "set server default_queue=batch"


widgets Related Articles

widgets Contribution

This article is contributed by Anonymous and text available under CC-SA-4.0

 

https://titanwolf.org/Network/Articles/Article?AID=edf50c25-eeaf-49bd-a52d-b49690b71d7c 

Post a Comment

Previous Post Next Post