Torque is a tool that controls job distribution in several computers (cluster).

Here are some tips to install Torque in 2 or more computers.

Installation (server and nodes)

First, follow this thread. It has all that is necessary steps to get started.

http://ubuntuforums.org/showthread.php?t=1372508

./configure
make
sudo make install
./torque.setup root

Do not forget to setup the correct libraries or you will get errors like this:
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: MaxConnectTimeout

sudo echo /usr/local/lib >> /etc/ld.so.conf
ldconfig

Server configuration

You have to configure /var/spool/torque/server_priv/nodes adding, for example:

my_server np=4
node01 np=4

Configure /var/spool/torque/mom_priv/config to allow the server to also be a node

$pbsserver my_server

Configure /var/spool/torque/server_name

my_server

Configuring the server with these commands is crucial:

# block 1

qmgr -c "set server scheduling=true"
qmgr -c "create queue batch queue_type=execution"
qmgr -c "set queue batch started=true"
qmgr -c "set queue batch enabled=true"
qmgr -c "set queue batch resources_default.nodes=4"
qmgr -c "set queue batch resources_default.walltime=3600"
qmgr -c "set server default_queue=batch"
qmgr -c "set server operators = me@computer.com"
qmgr -c "set server keep_completed = 0"

# block 2: before I did this, my jobs would get stuck in the queue. Adjust to your needs. I could not find much information on what each option does exactly

qmgr -c "set queue batch max_running = 8"
qmgr -c "set queue batch resources_max.ncpus = 8"
qmgr -c "set queue batch resources_min.ncpus = 1"
qmgr -c "set queue batch resources_max.nodes = 2"
qmgr -c "set queue batch resources_default.ncpus = 1"
qmgr -c "set queue batch resources_default.neednodes = 1:ppn=1"
qmgr -c "set queue batch resources_default.nodect = 1"
qmgr -c "set queue batch resources_default.nodes = 1"

Kill all pbs processes and restart:

killall pbs_server pbs_mom pbs_sched
pbs_server; pbs_mom; pbs_sched

If you have a firewall

  • Open UDP ports 15001 and 15004 and 1023.
  • I also had to add special scp instructions or open port 22 or my jobs would stay in E state forever.

To instruct the node (mom) to use the correct ssh port and take advantage of NFS:

$pbsserver my_server
$usecp *:/mnt/shared_disk /mnt/shared_disk
$rcpcmd /usr/bin/scp -P 2232

Nodes

  • Install torque as above.
  • Do not forget to fix the library.
  • Configure nodes with the contents of the box above suited to your needs.

Quick summary of errors I encountered

  1. Everything was setup properly, but jobs would stay in Q (queued) status or go direct to C (completed) state without ever running. Torque would send emails to root saying that the “Job Deleted because it would never run” and “Not enough of the right type of nodes available”. The solution is to use block 2 of the “set queue” commands above.
  2. Jobs would run but stay in E (exit) state forever, clogging the queue. The solution is to open the correct ports in the firewall. Ports 15001, 15004 and 1023 in the server must be open for the nodes.
  3. Jobs would still stay in E (exit) state forever, even after ports were opened. The solution is to either open port 22 or configure the node to use scp in the open port.
  4. pbs_mom was up in node, but appeared as “down” under pbsnodes. This was a firewall problem. For some reason, a new installation of torque required opening port 15096. Check /var/spool/torque/mom_logs/ and see what is the error message. Torque was nice enough to tell which port it was trying to connect in the server.

19 thoughts on “Installing Torque in Ubuntu 9.04

  1. Hi,
    I configure small cluster (1 server, 2 node) on ubuntu using torque-3.0.0.
    qnodes -a shows all node are in free state but the job is stuck in queue always.

    Inside server log, it’s just showing again again…
    02/23/2011 21:55:31;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.4.8, loglevel = 0
    02/23/2011 22:00:31;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.4.8, loglevel = 0
    02/23/2011 22:05:31;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.4.8, loglevel = 0
    ……..

    Any idea about this problem?

    1. I was having this problem before configuring qmgr (block 2 in my post), but I assume you’ve done that already.
      Check mom and sched logs too.
      Check if your firewall is blocking the ports used by torque.
      Can you force the job to run to completion?

  2. Hi,
    Thank you so much for reply. My job is now going to running state and completed.

    But now after submit the script it can’t able to create .out file.
    I submit testjob.job following this link:
    http://wiki.hpc.ufl.edu/index.php/TorqueHowto

    For qstat -f : (no error msg is shown)
    It shows…Output_Path = nike:/mnt/unicore/UNICORE_Servers/testjob.out (this folder is shared with nfs)

    But according to this path the testjob.out isn’t exist.
    Any idea about this problem…..

    1. How did you solve the problem? Does it run only when you force it with qrun?

      This is how I submit my jobs:
      qsub -V script.sh -o stdout.txt -e stderr.txt

      Do you see the job running in a given node?
      What may be happening is that torque, for some reason, is not able to access your NFS.
      I didn’t have this kind of problem, my output is saved to my NFS disk.
      Another possibility: did you configure a “scratch” location where each node writes data and then moves to the NFS? Maybe torque is not even being able to generate output?

  3. Hi,
    I didn’t use “qrun” to submit job.
    Actually before I didn’t restart the mom_sched and my job is stuck in queue.(I find this in net…if the job stuck in queue then restart the pbs_sched…..it works for me.)

    To submit job I use : qsub testjob.job/ submit.pbs
    I submit testjob.job following this link:
    http://wiki.hpc.ufl.edu/index.php/TorqueHowto

    1. My output file is creating inside /home/unicore folder (it containing my torque packages) but it doesn’t created inside /mnt/unicore (which I specified in config file and it is shared)….how can I do this?

    2. How can I check the job running in a given node….qstat is not working on node.

    It don’t understand what you mention on “Another possibility”….could you please explain a little…..actually I am new in linux…..I am programming in linux for last 1 month.

    1. qstat only works at the server.
      Login to the node and check with top or ps if your jobs is running.

      I was talking about your /home/unicore directory. Torque first outputs stuff to the node where it’s running the job, and then moves it to the final location.

      Did you configure torque to use scp with your proper config?
      Is your firewall open for torque?

  4. Hi,
    Thank you so much for your reply.

    1. My config file is look like :
    $pbsserver nike.sookmyung.ac.kr
    $usecp nike.sookmyung.ac.kr:/mnt/unicore /mnt/unicore (mounted folder)
    $logevent 255
    ……..(is it correct)

    2.I don’t have any firewall……when I grep Listen…..in server the listening port are 15001,15002,15003,15004,22 and in the node 15002, 15003,22…..is this enough?

    3. I can run mpi program from server through the cluster and the nodes also working. But when I try to use qsub to submit mpi program/use -l nodes=2…..the node are not working….program is completed and release queue……I also try to submit like you but still the same result

    4. My ssh is working from server to all node without passward….do ssh need to work among all nodes and to server without password?

    1. Try to change usecp to *:/mnt/unicore
      I actually don’t know how exactly this config works, but maybe you’re saying that only the server should copy to /mnt/unicore

      But do you see the program running in the cluster or not? You say it’s completed, but did it run? If it did, clearly it’s a problem of copying from node to nfs.

  5. Hi,
    I find one thing, my job is running and show the status R, then C and after that release the queue when inside server_priv/nodes file head node’s name is exist….if I delete the name of headnode…then job is stuck in queue and server_log shows……”PBS_Server;Svr;PBS_Server;LOG_ERROR::is_request, bad attempt to connect from 203.252.195.148:1023 (address not trusted – check entry in server_priv/nodes)”

    And inside sched_log….”pbs_sched;Svr;restart;restart on signal 1
    03/03/2011 16:26:04;0002; pbs_sched;Svr;die;caught signal 15″

    I did as many solution I get in net….but still same error.

    1. Do I need NIS/ should ssh work among all node without password……my ssh is working from server to node without passward….like as oneway.

  6. Why don’t you set passwordless ssh from nodes to server and see if it works?

    Notice that output to stdout is taken by torque. You can save it by using the -O option.

  7. Hi,
    I look this for last couple of days but I couldn’t find it in net how to activate ssh in bothway without password…..I only find oneway setting.

  8. Hi,
    Which torque version do you use for ubuntu 9.04…(do you use torque-2.4.3)?
    Is ubuntu 10.10 support torque-3.0.0.

  9. Hi,
    Thank you so much for your reply.

    In ubuntu 10.10 version I don’t open port for torque because after configuring the torgue the ports are open automatically…..I am wonder, in ubuntu 9.04 version do I need to open port by my-self using iptables or something like this…..

  10. Hey, I’m having issues with torque.
    pbs_mom is running and nodes are free but still showing “down”. Is it that nodes have to connect via ssh with no password? please I need help. I’m here to give any further information . Thank you very much

    1. I haven’t used torque in a long time but I have password-less ssh (I use RSA keys) between my computers. I had to configure how scp copies files so I think it’s safe to say you need to setup password-less ssh.

 

https://enotacoes.wordpress.com/2010/06/15/installing-torque-in-ubuntu-9-04/ 

Post a Comment

Previous Post Next Post