gLite Logo        EGEE JRA1 testing team                         

                         EGEE   gLite knowledge database for WMS related problems                                                 

                                                                                                                                                                           

# Problem(short) date details(cause) solution involeved
components
debug procedure savannah
bug #
1 Condor : Cannot create new cluster failure Feb

04

wms constraint make sure the QUEUE SUPER USER in the condor config on the CE is set to the Condor user name  on the WMS node WMS, UI see the globus log files and re installed machine ----
2 Cannot locate condor schedd

(Error locating schedd )

Nov 04

 

 gatekeeper on the CE not running or condor-c symlink missing or any auth/z failure log in on the CE and start the gatekeeper again or make sure auth/authz work - also make sure that the HOME directory for the Condor User (CONDOR_IDS) exists on the CE WMS, CE see the most relevant log files. checked that basic globus functionality is there 5525
3 Jobs keep staying in Ready Dec 04 Jan 05 LRMS not working, or not properly configured Make sure the LRMS is perfectly working WMS, CE do the basic LRMS test like qsub or bsub and retrieve output -----
4 no local mappings for globus id Oct 04 no authorization on the resource put the /etc/grid-security/grid-mapfile in place with your DN in CE edit files -----
5 condor daemons not starting and no log files available Jan 05 most likely wrong permissions or ownerships of relevant condor directories chown -R gproduct:gm /opt/condor-6.7.3 WMS see the condor standard log files -----
6 condor schedd doesn't start Nov 04 make sure the condor_config  is correct edit the /opt/condor-6.7.3/etc/condor_config file and the local one and make sure CONDOR_IDS are correct WMS check the condor configuration files -----
7 unable to contact any Network Server

(see also item  #20)

Nov

04

most likely the NS daemon is not running on the WMS node log in on the WMS and restart the WMS daemon or make sure you are in the gridmap file of the WMS UI,WMS check daemons status  on the NS 5325
8 Job got an error while in the CondorG queue. Nov 04 similar to problem 2 : normally it's because it did not find the schedd make sure g/k running on the CE and check the condor log files under /opt/condor-6.7.3/lxb1420.local/log UI, CE use the command glite-job-logging-info -v 2 JobId 5525
9 glite-job-list-match doesn't provide any resource for the job to be executed Jan 05 normally this is because something is wrong in the InfoSys, the fake BD-II log in on the fake BD-II and make sure things are properly set BD-II, UI use the update command for LCG bd-ii and ldapsearch to check the BD-II content 5115
10 no reaction during the submission commands Nov 04 sometimes this is due to wrong addressing to the LB machine check GLITE_LB_DESTINATION and GLITE_WMS_LOG_DESTINATION UI,LB edit the

/opt/glite/etc/egtest/

glite_wmsui.conf file and check or export proper variables

-----
11 failed logging to the LB daemon Dec 04 sometimes the bkserver on the LB is down restart LB server on the LB node with

/opt/glite/etc/init.d/glite-bkserver start

UI, LB check daemons status on the LB -----
12 fancy and different error messages when querying the daemons status on the WMS Jan 05 normally because not all the needed env is set source /etc/glite/profile.d/glite_setenv.sh WMS script debug -----
13 CE configuration failed and services not started - crash at iptables setting Jan 05 there is a bug in the CE installation script use the --no-iptables option while configuring  the CE  using the post-install script CE script debug -----
14 WMS postinstall script states at the end that there has been an error Jan 05 there is a bug in the WMS installation script ignore the very final message. Make sure though that daemons are running WMS script debug 6355
15 submission error : cannot create SandboxDir Jan 05 possible wrong configuration check the magic group in the ftpaccess file on /opt/glite/etc/ : you have to see a line like "magicgroup gm all" ( gm is the GLITE_WMS_GROUP) WMS script debug -----
16 AuthenticationException: Failed to establish security context... Jan 05 wired on WMS node : check WMS script debug -----
17 PBS error 15062. All jobs stuck. - Jan 05 No job gets executed make sure pbs_server.conf does not request a default attribute that you are not publishing in the node list CE check log files.reconfigure PBS -----
18 core dump when submitting - after VOMS proxy init - error 5027 reading proxy file Feb 05 Get a core dump make sure the host cert of the VOMS server is correctly placed on the UI under /etc/grid-security/vomsdir with the proper name (for ex. kuiken.nikhef.nl.pem) UI strace reveals the error reading proxy file. check UI configuration. put the VOMS host cert there -----
19 transport endpoint not connected during job submission Feb 05 LB server down restart the LB daemon on the LB server machine UI, LB - -----
20 Unable to contact any Network Server

(see also item #7)

Feb 05 something weird with the gridmap file or the pooled accounts(WMS) check both the grid-map file and the pooled accounts. Try a "su - egtest001" for example on the WMS UI,WMS c- -----
21 Job aborted because of expired proxy

(they get submitted but then immediately aborted with a proxy expired error msg)

 

 

June

05

misconfiguration of the WMS node.  check that GLITE_USER user an group are properly set and directory ownerships UI, WMS the system tells proxy expired also when in reality it could not acces it  
22 Job Aborted : attempts to submit failed June

05

PBS misconfiguration : 512 instead of 511 correct the value of log_level inside your PBS pbs server configuration