Patch-ID# 118086-03 Keywords: qmaster scheduler qmon qstat qconf qconf memory usage Synopsis: N1 Grid Engine 6.0: maintenance patch Date: Jan/19/2005 Install Requirements: Additional instructions may be listed below Solaris Release: SunOS Release: Unbundled Product: N1 Grid Engine Unbundled Release: 6.0 Xref: See patch matrix below Topic: Relevant Architectures: linux amd64 BugId's fixed with this patch: 5063311 5063316 5063987 5071498 5071502 5071522 5071525 5071539 5071914 5071918 5071987 5072005 5072481 5072772 5073218 5074788 5075346 5075398 5075451 5075849 5075936 5075968 5076358 5076372 5076491 5077165 5077549 5077589 5078783 5079514 5079572 5080779 5080784 5080833 5080836 5080839 5080840 5080851 5080852 5080853 5080856 5081821 5081822 5081839 5082490 5083102 5083115 5084317 5085010 5085392 5086108 5089222 5089255 5090162 5092487 5094016 5095907 5097732 5102320 5102442 5104270 5104789 5108635 5108639 6174301 6174326 6174331 6174821 6174915 6176115 6176177 6176181 6180529 6183365 6184460 6184466 6185136 6185169 6185208 6185211 6189286 6189289 6190164 6191366 6193348 6193361 6193866 6194002 6194625 6194713 6194729 6195249 6196578 6199261 6200013 6201033 6201038 6201039 6201040 6201042 6205648 6211243 6211309 6215580 6216020 Changes incorporated in this version: 6205648 6211243 6211309 6215580 6216020 Patches accumulated and obsoleted by this patch: Patches which conflict with this patch: Patches required with this patch: Obsoleted by: Files included with this patch: /bin/sol-x86/qacct /bin/sol-x86/qalter /bin/sol-x86/qconf /bin/sol-x86/qdel /bin/sol-x86/qhost /bin/sol-x86/qmake /bin/sol-x86/qmod /bin/sol-x86/qmon /bin/sol-x86/qping /bin/sol-x86/qsh /bin/sol-x86/qstat /bin/sol-x86/qsub /bin/sol-x86/qtcsh /bin/sol-x86/sge_coshepherd /bin/sol-x86/sge_execd /bin/sol-x86/sge_qmaster /bin/sol-x86/sge_schedd /bin/sol-x86/sge_shadowd /bin/sol-x86/sge_shepherd /lib/sol-x86/libXltree.so /lib/sol-x86/libcrypto.so.0.9.7 /lib/sol-x86/libdrmaa.so /lib/sol-x86/libspoolb.so /lib/sol-x86/libspoolc.so /lib/sol-x86/libssl.so.0.9.7 /utilbin/sol-x86/adminrun /utilbin/sol-x86/berkeley_db_svc /utilbin/sol-x86/checkprog /utilbin/sol-x86/checkuser /utilbin/sol-x86/db_archive /utilbin/sol-x86/db_checkpoint /utilbin/sol-x86/db_dump /utilbin/sol-x86/db_load /utilbin/sol-x86/filestat /utilbin/sol-x86/fstype /utilbin/sol-x86/gethostbyaddr /utilbin/sol-x86/gethostbyname /utilbin/sol-x86/gethostname /utilbin/sol-x86/getservbyname /utilbin/sol-x86/infotext /utilbin/sol-x86/loadcheck /utilbin/sol-x86/now /utilbin/sol-x86/openssl /utilbin/sol-x86/qrsh_starter /utilbin/sol-x86/rlogin /utilbin/sol-x86/rsh /utilbin/sol-x86/rshd /utilbin/sol-x86/sge_share_mon /utilbin/sol-x86/spooldefaults /utilbin/sol-x86/spooledit /utilbin/sol-x86/spoolinit /utilbin/sol-x86/testsuidroot /utilbin/sol-x86/uidgid Problem Description: 6205648 error in commlib read/write timeout handling 6211243 The qstat -ext -xml command is broken with N1GE6 Update 2 patch 6211309 qmaster running out of file descriptors 6215580 execd messages file contains errors for tight integrated jobs 6216020 pending job task deletion may not work (from 118086-02) 5075968 Thread enabled commlib coredumps on exit on a 32bit Solaris x86 box 5085010 qmon customize filter for running jobs does not filter 5086108 wrong message appears when queue instance becomes error state 5089222 scheduling weirdness with wild-card PE's 5089255 Submit to a queue domain is never scheduled 5090162 qmake does not export shell env. vars 5092487 hard resource requests ignored in parallel jobs 5094016 o-tickets assigned to departments are ignored 5095907 qacct -l is not working 5097732 Need detailed error messages from communication layer 5102320 memory leak in the scheduler, with pe jobs and resource requests 5102442 qconf -de crashes qmaster 5104270 Cannot add calendar with \ syntax 5104789 mail sent by qmaster leaves zombie processes 5108635 $ARCH required in path for qloadsensor and qidle. 5108639 qconf -sstree seg faults with large share trees 6174301 N1GE6: qsub -js and negative job_share numbers acts strangely/unexpectedly. 6174326 qconf -sq displayes "slots" in the complex_values line 6174331 Option "-v VAR" does not fetch from envrionment 6174821 segmentation fault when vmemsize limit is reached 6174915 qconf has wrong exit status 6176115 Show qmaster/execd application status in qping 6176177 restoring a backup does not restore the job_scripts dir. 6176181 qdel "" kills qmaster 6180529 meaningless job error state diagnosis text in qstat -j 6183365 qconf -sstree gives a SIGBUS error 6184460 qmod -[d|e] cannot handle the folowing qnames: "[0-9]*" 6184466 scheduler does not look ahead to consider queue calendars state transitions 6185136 Job customize shows weird characters for fields, additional fields cannot be added 6185169 qmon returns an error dialog, when editing a calendar 6185208 qmon and equal job arguments 6185211 Job environments should not include Grid Engine dynamic library path 6189286 memory leak in the scheduler with consumables as load thresholds 6189289 a cluster queue can be deleted, even though it is referenced in an other cq 6190164 too many array tasks are deleted 6191366 tightly integrated pe jobs: scheduler doesn't respect usage of pe tasks in sharetree calculation 6193348 qconf -mq does not output the subordinate_list correct 6193361 Jobs fail in case of NFS execd installation on volumes exported without root write priviledges 6193866 backup/restore does not work under Linux and others.. 6194002 sgemaster -migrate on qmaster host tries to start second qmaster 6194625 subordinate queues consume excessive memory 6194713 Only first subordinate queue will be suspended at qmaster restart 6194729 Subordinate queue thresholds are not spooled with BDB 6195249 QMON Cluster Queue Window: Heading line words does not match into column width 6196578 backup failes, when... 6199261 a sharetree delete can kill qmon 6200013 arch script does not know about /lib64 6201033 qmaster might fail if jobs are deleted which have multiple hold states applied 6201038 reduce the impact of qstat on the overall performance 6201039 qconf -ks gives bad error message if scheduler isn't running 6201040 Exit 99 jobs are not rescheduled to hosts where they ran before 6201042 qdel "*" produces error logging in qmaster messages file (from 118086-01) 5063311 high memory usage of schedd and qmaster (schedd_job_info) 5063316 PE job submit error, when qmaster is busy 5063987 qmaster cannot bind port below 1024 on Linux 5071498 projects not available after sge_qmaster restart 5071502 calendars broken 5071522 Startup of qmaster changes act_qmaster to `hostname` 5071525 qalter abort 5071539 qping doesn't support host_aliases file 5071914 scheduler ignores queue seqno for queue sorting 5071918 qmod -e '@' causes segmentation fault in qmaster 5071987 Qmaster requires a local conf in order to start. 5072005 drmaa_run_job() may change the current directory 5072481 Deleted pending job appears in qstat 5072772 sge_qmaster constantly rewrites spool files of tightly integrated parallel jobs 5073218 qconf -aq @ crashes qmaster 5074788 jobs on hold due to -a time cause qmaster/schedd get out of sync 5075346 Sharetree doesn't work correct 5075398 variable syntax : equal sign support 5075451 sched_conf(5) reprioritize_interval should default to 0 5075849 a registering event client can get events before it got its total update 5075936 qmon's queue filtering doesn't work 5076358 It shuld be used "." and "$" with qsub -N 5076372 "|" should be able to be used with qsub -N 5076491 qmaster clients may not reconnect after qmaster outage 5077165 reprioritize_interval descr in sched_conf(5) needs improvemen 5077549 qsub -N "@" causes qmaster down 5077589 schedd and qmaster get out of sync - no scheduling for long time 5078783 Wallclock time limit in qmon 5079514 execd shutdown with sgeexecd fails when host aliases are used 5079572 Resending queue signals broken 5080779 qconf -de host does not update the host groups 5080784 qselect crash 5080833 qconf -mattr dumps core if used incorrectly 5080836 qhosts outputs NCPU as float 5080839 qconf -mq displayes "slots" in the complex_values line 5080840 problems when qconf -mattr is used in conjunction with host_aliases file 5080851 qalter/qdel/qmod abort 5080852 qconf -aq @ crashes qmaster 5080853 DRMAA doesn't reject jobs that never will be dispatchable 5080856 QCONF: qconf -mc segfaults 5081821 qstat XML output typo 5081822 Deleting a queue instance slots value actually adds it 5081839 qconf -ahgrp fails if no hgrp name is specified 5082490 qstat -ext -urg omits time info 5083102 hostgroup changes do not always take effect. 5083115 Need more verbose diagnosis msg if execd port is already bound 5084317 Invalid job_id's in reporting file (only l24_amd64) 5085392 qstat -j -xml generates no parseble xml output Patch Installation Instructions: -------------------------------- tar.gz Patch Installation: -------------------------- See the patch installation instructions below before installing this patch! Patches in 'tar.gz' format cannot be installed with 'patchadd' on Solaris systems. The patch is installed by unpacking the 'tar.gz' file(s) in this directory in . is usually your directory. The installation of this patch later is not visible with the "showrev -p" command on Solaris. This patch cannot be backed out. You may want to make a backup copy of the files before installing this patch since the files will be overwritten. Please read "Install Instructions" later in this file and carry out all steps before you unpack the 'tar.gz' file(s) included in this patch. This patch in 'tar.gz' format should not be installed if the original package has been installed with 'pkgadd' on Solaris. If the orginal installation used packages ('pkgadd') utility, install the available patches for N1 Grid Engine 6; refer to the patch matrix below. The patch is installed by user root by unpacking the file(s) in the directory where the original package has been installed: # cd # gzip -dc / | tar xvpf - After installing the patch, you should correct the file permissions if your Sun Grid Engine installation is installed as an "admin user" system: # cd # util/setfileperm.sh Patch requirements and patch matrix for N1 Grid Engine 6 packages ----------------------------------------------------------------- The patches below update a N1 Grid Engine 6 distribution to N1 Grid Engine 6 Update 2. The "-help" output of most commands will print a version string "N1GE 6.0u2" after applying the patch. All packages of a N1 Grid Engine 6 distribution must have the same patch level. Refer to the patch matrix below which updates the distribution to most recent patch level. It is not supported to mix different patch levels in a single Grid Engine cluster. Also, make sure to update the "common" and "arco" packages. 1. Patches for Packages in Sun pkgadd format -------------------------------------------- Package name* OS* Architecture* Patch-Id ----------------------------------------------------------------- SUNWsgee Solaris, Sparc, 32bit sol-sparc 118094-03 SUNWsgeex Solaris, Sparc, 64bit sol-sparc64 118130-03 SUNWsgeex Solaris x86 sol-x86 118131-03 SUNWsgeec all common 118132-03 SUNWsgeea all arco 118133-03 *Package Name = see pkginfo(1) *OS = Operating system *Architecture = N1 Grid Engine binary architecture string or "common" = architecture independent packages "arco" = Accounting and Reproting console 2. Patches for packages in tar.gz format ---------------------------------------- OS* Architecture Patch-Id ----------------------------------------------------- Solaris, Sparc, 32bit sol-sparc 118082-03 Solaris, Sparc, 64bit sol-sparc64 118083-03 Solaris, x86 sol-x86 118084-03 Linux kernel2.4/2.6, x86 lx24-x86 118085-03 Linux kernel2.4/2.6, AMD64 lx24-amd64 118086-03 IBM AIX 4.3 aix43 118087-03 IBM AIX 5.1 aix51 118088-03 Apple MAC OS/X darwin 118089-03 HP HP-UX 11 hp11 118090-03 SGI Irix 6.5 irix65 118091-03 all common 118092-03 all arco 118093-03 Special Install Instructions: ----------------------------- NOTE: This patch (6.0u2 or higher) requires that you update your BerkeleyDB database files. Please read the full notes when applying this patch. These installation instructions assume that you are running a homogenous N1 Grid Engine cluster (called "the software") where all hosts share the same directory for the binaries. If you are running the software in a heterogenous environment (mix of different binary architectures), you need to apply the patch installation for all binary architectures as well as the "common" and "arco" packages. See the patch matrix above for details about the available patches. If you installed the software on local filesystems, you need to install all relevant patches on all hosts where you installed the software locally. By default, there should be no running jobs when the patch is installed. There may pending batch jobs, but no pending interactive jobs (qrsh, qmake, qsh, qtcsh). It is possible to install the patch with running batch jobs. To avoid a failure of the active 'sge_shepherd' binary, it is necessary to move the old shepherd binary (and copy it back prior to the installation of the patch). You can not install the patch with running interactive jobs, 'qmake' jobs or with running parallel jobs which use the tight integration support (control_slaves=true in PE configuration is set). Stopping the N1 Grid Engine cluster to start jobs ------------------------------------------------- Disable all queues so that no new jobs are started: # qmod -d '*' Optional (only needed if there are running jobs which should continue to run when the patch is installed): # cd $SGE_ROOT/bin # mv /sge_shepherd /sge_shepherd.sge60 It is important that the binary is moved with the "mv" command. It should not be copied because this could cause the crash of an active shepherd process which is currently running job when the patch is installed. Shutting down the N1 Grid Engine qmaster, scheduler and execution daemons ------------------------------------------------------------------------- You need to shutdown (and restart) the qmaster and scheduler daemon and all running execution daemons. Shutdown all your execution hosts. Login to all your execution hosts and stop the execution daemons: # /etc/init.d/sgeexecd softstop Then login to your qmaster machine and stop qmaster and scheduler: # /etc/init.d/sgemaster stop Now verify with the 'ps' command that all N1 Grid Engine daemons on all hosts are stopped. If you decided to rename the 'sge_shepherd' binary so that running jobs can continue to run during the patch installation, you must not kill the 'sge_shepherd' binary (process). Installing the patch and restarting the software ------------------------------------------------ Now install the patch by installing the patch with "patchadd" or by unpacking the 'tar.gz' files included in this patch as outlined above. Berkeley DB database update needed ---------------------------------- After installing this patch, and before restarting your cluster you need to update your Berkeley DB (BDB) database in the following cases: - you choose the BDB spooling option (not needed for classic spooling) either locally or with the BDB RPC option, and you are upgrading your cluster for N1 Grid Engine 6.0 or 6.0u1 to N1 Grid Engine 6.0u2 or higher 1. For safety reasons, please make a full backup of your existing configuration. To perform a backup use this command % inst_sge -bup 2. Upgrade your BDB database. This is done as follows: % inst_sge -updatedb Restarting the software ----------------------- Please login to your qmaster machine and execution hosts and enter: # /etc/init.d/sgemaster # /etc/init.d/sgeexecd After restarting the software, you may again enable your queues: # qmod -e '*' If you renamed the shepherd binary, you may safely delete the old binary when all jobs which where running prior the patch installation have finished. New functionality delivered with this patch ------------------------------------------- 1. Avoid setting of LD_LIBRARY_PATH; inherited job environment -------------------------------------------------------------- There are two new "execd_params" (defined in the global or local cluster configuration) which control the environment inherited by a job: SET_LIB_PATH INHERIT_ENV By default, SET_LIB_PATH is false and INHERIT_ENV is true. If SET_LIB_PATH is true and INHERIT_ENV is true, each job will inherit the environment of the shell that started the execd, with the N1GE lib directory prepended to the lib path. If SET_LIB_PATH is true and INHERIT_ENV is false, the environment of the shell that started the execd will not be inherited by jobs, and the lib path will contain only the N1GE lib directory. If SET_LIB_PATH is false and INHERIT_ENV is true, each job will inherit the environment of the shell that started the execd with no additional changes to the lib path. If SET_LIB_PATH is false and INHERIT_ENV is false, the environment of the shell that started the execd will not be inherited by jobs, and the lib path will be empty. Environment variables which are normally overwritten by the shepherd, such as PATH or LOGNAME, are unaffected by these new parameters. 2. DRMAA Java[TM] language binding delivered with this patch ------------------------------------------------------------ The DRMAA Java language binding is now available. The DRMAA Java language binding library and documentation is contained in the patch for the "common" package. 3. New qstat options to optimize memory overhead and speed of qstat ------------------------------------------------------------------- The qstat client command has been enhanced to reduce the overall amount of memory which is requested from the qmaster. To enable these changes it is necessary to change the qstat default behaviour. This is possible by definig a cluster-global or user-specific sge_qstat file. More information can be found in sge_qstat(5) manual page. In addition two new qstat options ("-u" and "-s") have been introduced to be used with the sge_qstat default file. Find more information in qstat(1). 4. Tuning parameter for sharetree spooling ------------------------------------------ A new "qmaster_param" (confiugred in the global cluster configuration): STREE_SPOOL_INTERVAL=