torque pbs 4.0.1 posti di lavoro in coda ('Q') stato; lo scheduler non sembra ricevere alcuna notifica

Sto usando la coppia 4.0.1 su openSUSE 12.1 in un ambiente cluster. Quando qsub un lavoro (semplice come "echo hello"), rimane in stato "Q" e non viene mai pianificato. Posso forzare il lavoro per eseguire con qrun e viene eseguito sul primo nodo senza errori.

Ho cercato di trovare le soluzioni negli ultimi giorni, ma non sono riuscito. Ho letto il manuale, i registri, anche il codice sorgente, ma ancora non posso individuare il problema. Naturalmente ho scelto molto, ho provato varie soluzioni, ma nessuno ha lavorato.

Ecco alcune informazioni che forse utili:

  • pbs_sched è in esecuzione, ma i suoi log sembrano suggerire che non riceva alcuna notifica circa i posti di lavoro in coda.
05/13/2012 18:55:08;0002; pbs_sched;Svr;Log;Log opened 05/13/2012 18:55:08;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120513 opened 05/13/2012 18:55:08;0002; pbs_sched;Svr;main;pbs_sched startup pid 32604 
  • Il log di pbs_server ha mostrato che il process è stato cifrato nel batch della coda di default:
 05/13/2012 19:33:08;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.1, loglevel = 0 05/13/2012 19:33:56;0100;PBS_Server;Job;16.head;enqueuing into batch, state 1 hop 1 05/13/2012 19:33:56;0008;PBS_Server;Job;16.head;Job Queued at request of pubuser@head, owner = pubuser@head, job name = STDIN, queue = batch 
  • qstat -f 16 non mostrava niente di utile
 Job Id: 16.head Job_Name = STDIN Job_Owner = pubuser@head job_state = Q queue = batch server = head Checkpoint = u ctime = Sun May 13 19:33:56 2012 Error_Path = head:/fserver/home/pubuser/STDIN.e16 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Sun May 13 19:33:56 2012 Output_Path = head:/fserver/home/pubuser/STDIN.o16 Priority = 0 qtime = Sun May 13 19:33:56 2012 Rerunable = True Resource_List.walltime = 01:00:00 substate = 10 Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/, PBS_O_WORKDIR=/fserver/home/pubuser,PBS_O_HOST=head,PBS_O_SERVER=head, PBS_O_WORKDIR=/fserver/home/pubuser euser = pubuser egroup = users queue_rank = 4 queue_type = E etime = Sun May 13 19:33:56 2012 fault_tolerant = False job_radix = 0 submit_host = head init_work_dir = /fserver/home/pubuser 
  • Tutti i nodes sono liberi:
 sun1 state = free np = 2 ntype = cluster status = rectime=1336910403,varattr=,jobs=,state=free,netload=44492032184,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1697420kb,totmem=1802616kb,idletime=241085,nusers=0,nsessions=0,uname=Linux sun1 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun2 state = free np = 2 ntype = cluster status = rectime=1336910408,varattr=,jobs=,state=free,netload=39762812881,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1701012kb,totmem=1802616kb,idletime=239982,nusers=0,nsessions=0,uname=Linux sun2 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun3 state = free np = 2 ntype = cluster status = rectime=1336910400,varattr=,jobs=,state=free,netload=45984311925,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1699772kb,totmem=1802616kb,idletime=212303,nusers=0,nsessions=0,uname=Linux sun3 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun4 state = free np = 2 ntype = cluster status = rectime=1336910407,varattr=,jobs=,state=free,netload=37538584401,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805480kb,totmem=1908308kb,idletime=211197,nusers=0,nsessions=0,uname=Linux sun4 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun5 state = free np = 2 ntype = cluster status = rectime=1336910411,varattr=,jobs=,state=free,netload=173547166,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803816kb,totmem=1908308kb,idletime=211199,nusers=0,nsessions=0,uname=Linux sun5 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun6 state = free np = 2 ntype = cluster status = rectime=1336910411,varattr=,jobs=,state=free,netload=24641446,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805704kb,totmem=1908308kb,idletime=212999,nusers=0,nsessions=0,uname=Linux sun6 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun7 state = free np = 2 ntype = cluster status = rectime=1336910412,varattr=,jobs=,state=free,netload=1548383055,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805432kb,totmem=1908308kb,idletime=215630,nusers=0,nsessions=0,uname=Linux sun7 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun8 state = free np = 2 ntype = cluster status = rectime=1336910400,varattr=,jobs=,state=free,netload=128755968,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803448kb,totmem=1908308kb,idletime=211866,nusers=0,nsessions=0,uname=Linux sun8 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 sun9 state = free np = 2 ntype = cluster status = rectime=1336910374,varattr=,jobs=,state=free,netload=1371896399,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805664kb,totmem=1908308kb,idletime=211161,nusers=0,nsessions=0,uname=Linux sun9 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 
  • qmgr -c 'p s':
 # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = head set server managers = pubuser@head set server managers += root@head set server operators = pubuser@head set server operators += root@head set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server keep_completed = 0 set server submit_hosts = head set server next_job_number = 17 set server moab_arrays_compatible = True 
  • momctl -d 13 sul primo nodo:
 Host: sun1/sun1 Version: 4.0.1 PID: 5362 Server[0]: head (192.168.0.1:15001) Last Msg From Server: 1584 seconds (DeleteJob) Last Msg To Server: 7 seconds HomeDirectory: /var/spool/torque/mom_priv stdout/stderr spool directory: '/var/spool/torque/spool/' (4457492 blocks available) MOM active: 229485 seconds Check Poll Time: 45 seconds Server Update Interval: 45 seconds LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) Communication Model: TCP MemLocked: TRUE (mlock) TCP Timeout: 0 seconds Trusted Client List: 127.0.0.1:0,192.168.0.1:0,192.168.0.101:0,192.168.0.101:15003,192.168.0.102:15003,192.168.0.103:15003,192.168.0.104:15003,192.168.0.105:15003,192.168.0.106:15003,192.168.0.107:15003,192.168.0.108:15003,192.168.0.109:15003: 0 Copy Command: /usr/bin/scp -rpB NOTE: no local jobs detected diagnostics complete 

Il problema è che TCP Timeout è 0 secondi, che non sembra essere normale. Durante la diagnostica, il registro seguente è stato trovato in mom_logs

 05/13/2012 20:30:10;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Resource temporarily unavailable (11) in tcp_read_proto_version, no protocol version number End of File (errno 2) 

L'ho scoperto, ma non ho trovato niente.

  • Ho compilato OpenMPI con questa coppia 4.0.1 (per il supporto di tm) e posso programmare i programmi di prova senza alcun problema.

Spero che qualcuno possa risolvere questo problema. Grazie!

Suggerimenti per Linux e Windows Server, quali Ubuntu, Centos, Apache, Nginx, Debian e argomenti di rete.