| # |
Problem(short) |
date |
details(cause) |
solution |
involeved
components |
debug procedure |
savannah
bug # |
| 1 |
Condor : Cannot
create new cluster failure |
Feb 04 |
wms constraint
|
make
sure the QUEUE SUPER USER in the condor config on the CE is set to the
Condor user name on the WMS node |
WMS, UI |
see the globus log
files and re installed
machine |
---- |
| 2 |
Cannot locate condor
schedd (Error locating schedd ) |
Nov 04 |
gatekeeper on the CE not running
or condor-c symlink missing or any auth/z failure |
log in
on the CE and start the gatekeeper again or make sure auth/authz work -
also make sure that the HOME directory for the Condor User (CONDOR_IDS)
exists on the CE |
WMS, CE |
see the most
relevant log files. checked that basic globus functionality is there |
5525 |
| 3 |
Jobs keep staying in Ready |
Dec 04 Jan 05 |
LRMS not
working, or not properly configured |
Make
sure the LRMS is perfectly working |
WMS, CE |
do the basic LRMS
test like qsub or bsub and retrieve output |
----- |
| 4 |
no local mappings for globus id |
Oct 04 |
no
authorization on the resource |
put the
/etc/grid-security/grid-mapfile in place with your DN in |
CE |
edit files |
----- |
| 5 |
condor daemons not
starting and no log files available |
Jan 05 |
most
likely wrong permissions or ownerships of relevant condor directories |
chown -R
gproduct:gm /opt/condor-6.7.3 |
WMS |
see the condor
standard log files |
----- |
| 6 |
condor schedd
doesn't start |
Nov 04 |
make sure
the condor_config is correct |
edit the
/opt/condor-6.7.3/etc/condor_config file and the local one and make sure
CONDOR_IDS are correct |
WMS |
check the condor
configuration files |
----- |
| 7 |
unable to contact
any Network Server
(see also item #20) |
Nov 04 |
most
likely the NS daemon is not running on the WMS node |
log in
on the WMS and restart the WMS daemon or make sure you are in the
gridmap file of the WMS |
UI,WMS |
check daemons
status on the NS |
5325 |
| 8 |
Job got an error
while in the CondorG queue. |
Nov 04 |
similar
to problem 2 : normally it's because it did not find the schedd |
make
sure g/k running on the CE and check the condor log files under
/opt/condor-6.7.3/lxb1420.local/log |
UI, CE |
use the command
glite-job-logging-info -v 2 JobId |
5525 |
| 9 |
glite-job-list-match
doesn't provide any resource for the job to be executed |
Jan 05 |
normally
this is because something is wrong in the InfoSys, the fake BD-II |
log in
on the fake BD-II and make sure things are properly set |
BD-II, UI |
use the update
command for LCG bd-ii and ldapsearch to check the BD-II content |
5115 |
| 10 |
no reaction during
the submission commands |
Nov 04 |
sometimes this is due to wrong addressing to the LB machine |
check
GLITE_LB_DESTINATION and GLITE_WMS_LOG_DESTINATION |
UI,LB |
edit the /opt/glite/etc/egtest/
glite_wmsui.conf file and check or export proper variables |
----- |
| 11 |
failed logging to
the LB daemon |
Dec 04 |
sometimes the bkserver on the LB is down |
restart
LB server on the LB node with /opt/glite/etc/init.d/glite-bkserver
start |
UI, LB |
check daemons
status on the LB |
----- |
| 12 |
fancy and different
error messages when querying the daemons status on the WMS |
Jan 05 |
normally
because not all the needed env is set |
source
/etc/glite/profile.d/glite_setenv.sh |
WMS |
script debug |
----- |
| 13 |
CE configuration
failed and services not started - crash at iptables setting |
Jan 05 |
there is
a bug in the CE installation script |
use the
--no-iptables option while configuring the CE using the
post-install script |
CE |
script debug |
----- |
| 14 |
WMS postinstall
script states at the end that there has been an error |
Jan 05 |
there is
a bug in the WMS installation script |
ignore
the very final message. Make sure though that daemons are running |
WMS |
script debug |
6355 |
| 15 |
submission error : cannot create SandboxDir |
Jan 05 |
possible wrong configuration |
check the magic group in the ftpaccess file on
/opt/glite/etc/ : you have to see a line like "magicgroup gm all" ( gm is the GLITE_WMS_GROUP) |
WMS |
script debug |
----- |
| 16 |
AuthenticationException: Failed to establish security context... |
Jan 05 |
wired on
WMS node : |
check |
WMS |
script debug |
----- |
| 17 |
PBS error 15062. All jobs stuck. - |
Jan 05 |
No job gets executed |
make sure pbs_server.conf does not request a default attribute
that you are not publishing in the node list |
CE |
check log files.reconfigure PBS |
----- |
| 18 |
core dump when submitting - after VOMS proxy init - error 5027 reading proxy file |
Feb 05 |
Get a core dump |
make sure the host cert of the VOMS server is
correctly placed on the UI under /etc/grid-security/vomsdir with the proper name
(for ex. kuiken.nikhef.nl.pem) |
UI |
strace reveals the error reading proxy file. check UI configuration. put the VOMS host cert there |
----- |
| 19 |
transport endpoint not connected during job submission |
Feb 05 |
LB
server down |
restart
the LB daemon on the LB server machine |
UI, LB |
- |
----- |
| 20 |
Unable to contact any Network Server
(see also item #7) |
Feb 05 |
something weird with the gridmap file or the pooled accounts(WMS) |
check
both the grid-map file and the pooled accounts. Try a "su - egtest001"
for example on the WMS |
UI,WMS |
c- |
----- |
| 21 |
Job aborted because of expired proxy (they get submitted but then
immediately aborted with a proxy expired error msg)
|
June 05 |
misconfiguration of the WMS node. |
check
that GLITE_USER user an group are properly set and directory ownerships |
UI, WMS |
the system tells
proxy expired also when in reality it could not acces it |
|
| 22 |
Job Aborted : attempts to submit failed |
June 05 |
PBS
misconfiguration : 512 instead of 511 |
correct the
value of log_level inside your PBS pbs server configuration |
|
|
|