OBSOLETE Patch-ID# 120434-01 NOTE: *********************************************************************** READ THE TERMS OF THE AGREEMENT ("AGREEMENT") IN THE LEGAL_LICENSE.TXT FILE CAREFULLY BEFORE USING THIS SOFTWARE. BY USING THE SOFTWARE, YOU AGREE TO THE TERMS OF THIS AGREEMENT. IF YOU DO NOT AGREE TO ALL OF THE TERMS, PROMPTLY DESTROY THE UNUSED SOFTWARE. *********************************************************************** Keywords: qmaster scheduler qmon qstat qconf qconf memory usage Synopsis: Obsoleted by: 120434-02 N1 Grid Engine 6.0: maintenance patch Date: Jul/18/2005 Install Requirements: Additional instructions may be listed below Solaris Release: SunOS Release: Unbundled Product: N1 Grid Engine Unbundled Release: 6.0 Xref: See patch matrix below Topic: Relevant Architectures: BugId's fixed with this patch: 4769608 6218877 6239470 6250603 6252469 6252525 6260656 6264592 6265154 6266392 6266450 6267238 6267245 6267932 6268707 6269305 6269411 6273006 6273217 6274467 6277909 6278147 6278727 6279402 6279409 6280698 6281440 6281462 6283308 6285898 6286510 6286533 6287867 6287953 6287958 6288156 6288588 6294397 6295165 Changes incorporated in this version: 4769608 6218877 6239470 6250603 6252469 6252525 6260656 6264592 6265154 6266392 6266450 6267238 6267245 6267932 6268707 6269305 6269411 6273006 6273217 6274467 6277909 6278147 6278727 6279402 6279409 6280698 6281440 6281462 6283308 6285898 6286510 6286533 6287867 6287953 6287958 6288156 6288588 6294397 6295165 Patches accumulated and obsoleted by this patch: Patches which conflict with this patch: Patches required with this patch: Obsoleted by: Files included with this patch: /bin/win32-x86/qacct /bin/win32-x86/qalter /bin/win32-x86/qconf /bin/win32-x86/qdel /bin/win32-x86/qhost /bin/win32-x86/qloadsensor.exe /bin/win32-x86/qmake /bin/win32-x86/qmod /bin/win32-x86/qping /bin/win32-x86/qsh /bin/win32-x86/qstat /bin/win32-x86/qsub /bin/win32-x86/qtcsh /bin/win32-x86/sge_coshepherd /bin/win32-x86/sge_execd /bin/win32-x86/sge_shepherd /bin/win32-x86/sgepasswd /lib/win32-x86/libcrypto.so.0.9.7 /lib/win32-x86/libspoolb.so /lib/win32-x86/libspoolc.so /lib/win32-x86/libssl.so.0.9.7 /utilbin/win32-x86/adminrun /utilbin/win32-x86/checkprog /utilbin/win32-x86/checkuser /utilbin/win32-x86/filestat /utilbin/win32-x86/fstype /utilbin/win32-x86/gethostbyaddr /utilbin/win32-x86/gethostbyname /utilbin/win32-x86/gethostname /utilbin/win32-x86/getservbyname /utilbin/win32-x86/infotext /utilbin/win32-x86/loadcheck /utilbin/win32-x86/now /utilbin/win32-x86/openssl /utilbin/win32-x86/qrsh_starter /utilbin/win32-x86/rlogin /utilbin/win32-x86/rsh /utilbin/win32-x86/rshd /utilbin/win32-x86/sge_share_mon /utilbin/win32-x86/spooldefaults /utilbin/win32-x86/spooledit /utilbin/win32-x86/spoolinit /utilbin/win32-x86/testsuidroot /utilbin/win32-x86/uidgid Problem Description: 6295165 finished array job tasks can be rescheduled if master/scheduler daemons are stopped/started 6294397 wrong drmaa jnilib link on MacOS 6288588 jobs submitted with -v PATH do not retain $TMPDIR prefixed by N1GE as required for tight integration 6288156 sge_shepherd SEGV's when it tries to fopen the usage file 6287958 suspend not working under Mac OS X 6287953 getting many E messages "failed building category string for job N" 6287867 tight integration: temporary files are not deleted at task exit 6286533 job wallclock monitoring and enforcement considers prolog/epilog runtime part of net job runtime 6286510 delivery of queue based signals to execd repeated endlessly 6285898 qconf -Xattr does not resolve fqdn hostnames 6283308 overhead with job execution could lead to overoptimistic backfilling and break resource reservation 6281462 qmaster profiling can only be turned on by restarting qmaster 6281440 resource allocation shown by qstat/qhost not consistent with resource utilization 6280698 Resource filtering with qhost broken 6279409 qconf -tsm command generates too much data (very large schedd_runlog file) 6279402 drmaa_exit() causes qmaster error logging if host is no admin host 6278727 qstat -xml -urg output contains badly formatted numbers 6278147 drmaa_job_ps() returns DRMAA_PS_QUEUED_ACTIVE for finished array job rather than DRMAA_PS_DONE 6277909 qconf -mq coredumps 6274467 qmon kills a system 6273217 race condition with qsub -sync and drmaa_wait() if job exits directly after being submitted 6273006 qstat -j "" results in a segmentation fault 6269411 Close integration cause jobscripts with multiple mprun commands to be killed. 6269305 qrsh/qsh/qlogin reject -js option 6268707 job_load_adjustements is not correctly working when parallel jobs are submitted. 6267932 high CPU load of qmaster even on empty cluster 6267245 Repeated logging of the same message produces giant logging files 6267238 Multithreaded DRMAA may crash due to use of sge_strtok() 6266450 performace bottleneck with subordinate list 6266392 Performance problem with qconf -mattr exechost XX XX global 6265154 Wildcards in PE Name Cause Unusual Behavior 6264592 drmaa_control(DRMAA_JOB_IDS_SESSION_ALL, DRMAA_CONTROL_SUSPEND|RESUME) returns INVALID_JOB error 6260656 incomplete resource reservation with array jobs 6252525 qmon: complex attributes not removeable 6252469 missleading qstat -j messages in case of resource reservation 6250603 qmon crash (segmentation fault) on Solaris64 6218877 qstat -t is broken 4769608 qalter shows wrong priority number when using negative priorities with -p option The following Change Request (CR) is related only to Grid Engine running under the Microsoft Windows operating system family 6239470 Avoid that sge_execd has to be started by the Domain Administrator Patch Installation Instructions: -------------------------------- tar.gz Patch Installation: -------------------------- See the patch installation instructions below before installing this patch! Patches in 'tar.gz' format cannot be installed with 'patchadd' on Solaris systems. The patch is installed by unpacking the 'tar.gz' file(s) in this directory in . is usually your directory. The installation of this patch later is not visible with the "showrev -p" command on Solaris. This patch cannot be backed out. You may want to make a backup copy of the files before installing this patch since the files will be overwritten. Please read "Install Instructions" later in this file and carry out all steps before you unpack the 'tar.gz' file(s) included in this patch. This patch in 'tar.gz' format should not be installed if the original package has been installed with 'pkgadd' on Solaris. If the original installation used packages ('pkgadd') utility, install the available patches for N1 Grid Engine 6; refer to the patch matrix below. The patch is installed by user root by unpacking the file(s) in the directory where the original package has been installed: # cd # gzip -dc / | tar xvpf - After installing the patch, you should correct the file permissions if your Sun Grid Engine installation is installed as an "admin user" system: # cd # util/setfileperm.sh Patch requirements and patch matrix for N1 Grid Engine 6 packages ----------------------------------------------------------------- The patches below update a N1 Grid Engine 6 distribution to N1 Grid Engine 6 Update 5 (N1GE 6.0u5). The "-help" output of most commands will print a version string "N1GE 6.0u5" after applying the patch. All packages of a N1 Grid Engine 6 distribution must have the same patch level (exception for ARCo - see requirements for ARCo in the ARCo patches README's). Please refer to the patch matrix below which updates the distribution to most recent patch level. It is not supported and possible to mix different patch levels of binaries and the "common" package in a single N1 Grid Engine cluster. 1. Patches for packages in Sun pkgadd format -------------------------------------------- Package name* OS* Architecture* Patch-Id ----------------------------------------------------------------- SUNWsgee Solaris, Sparc, 32bit sol-sparc 118094-05 SUNWsgeex Solaris, Sparc, 64bit sol-sparc64 118130-05 SUNWsgeex Solaris x86 sol-x86 118131-05 SUNWsgeeax Solaris, x64 (AMD64) sol-amd64 120438-01 SUNWsgeec all common 118132-05 SUNWsgeea all arco 118133-04 SUNWsgeed all doc 119846-02 *Package Name = see pkginfo(1) *OS = Operating system *Architecture = N1 Grid Engine binary architecture string or "common" = architecture independent packages "arco" = Accounting and Reporting console "doc" = PDF documentation "gemm" = Grid Engine Management Module for Sun Control Station (SCS) (tar.gz only) 2. Patches for packages in tar.gz format ---------------------------------------- OS* Architecture Patch-Id ----------------------------------------------------- Solaris, Sparc, 32bit sol-sparc 118082-05 Solaris, Sparc, 64bit sol-sparc64 118083-05 Solaris, x86 sol-x86 118084-05 Solaris, x64 (AMD64) sol-amd64 120439-01 Linux kernel2.4/2.6, x86 lx24-x86 118085-05 Linux kernel2.4/2.6, AMD64 lx24-amd64 118086-05 IBM AIX 4.3 aix43 118087-05 IBM AIX 5.1 aix51 118088-05 Apple MAC OS/X darwin 118089-05 HP HP-UX 11 hp11 118090-05 SGI Irix 6.5 irix65 118091-05 Microsoft Windows win32-x86 120434-01 all common 118092-05 all arco 118093-04 all doc 119861-02 Solaris, Linux gemm 120435-01 Special Install Instructions: ----------------------------- NOTE: This patch requires that you update your Berkeley DB database files if you are upgrading from N1GE 6.0u1 or 6.0. Please read the full notes when applying this patch. These installation instructions assume that you are running a homogenous N1 Grid Engine cluster (called "the software") where all hosts share the same directory for the binaries. If you are running the software in a heterogenous environment (mix of different binary architectures), you need to apply the patch installation for all binary architectures as well as the "common" and "arco" packages. See the patch matrix above for details about the available patches. If you installed the software on local filesystems, you need to install all relevant patches on all hosts where you installed the software locally. By default, there should by no running jobs when the patch is installed. There may pending batch jobs, but no pending interactive jobs (qrsh, qmake, qsh, qtcsh). It is possible to install the patch with running batch jobs. To avoid a failure of the active 'sge_shepherd' binary, it is necessary to move the old shepherd binary (and copy it back prior to the installation of the patch). You can not install the patch with running interactive jobs, 'qmake' jobs or with running parallel jobs which use the tight integration support (control_slaves=true in PE configuration is set). Stopping the N1 Grid Engine cluster to start jobs ------------------------------------------------- Disable all queues so that no new jobs are started: # qmod -d '*' Optional (only needed if there are running jobs which should continue to run when the patch is installed): # cd $SGE_ROOT/bin # mv /sge_shepherd /sge_shepherd.sge60 It is important that the binary is moved with the "mv" command. It should not be copied because this could cause the crash of an active shepherd process which is currently running job when the patch is installed. Shutting down the N1 Grid Engine qmaster, scheduler and execution daemons ------------------------------------------------------------------------- You need to shutdown (and restart) the qmaster and scheduler daemon and all running execution daemons. Shutdown all your execution hosts. Login to all your execution hosts and stop the execution daemons: # /etc/init.d/sgeexecd softstop Then login to your qmaster machine and stop qmaster and scheduler: # /etc/init.d/sgemaster stop Now verify with the 'ps' command that all N1 Grid Engine daemons on all hosts are stopped. If you decided to rename the 'sge_shepherd' binary so that running jobs can continue to run during the patch installation, you must not kill the 'sge_shepherd' binary (process). Installing the patch and restarting the software ------------------------------------------------ Now install the patch by installing the patch with "patchadd" or by unpacking the 'tar.gz' files included in this patch as outlined above. Berkeley DB database update needed ---------------------------------- NOTE: This update is not needed if you already installed N1GE 6.0u3 or higher. The update is only needed if you are upgrading from N1GE 6.0u1 or earlier. After installing this patch, and before restarting your cluster you need to update your Berkeley DB (BDB) database in the following cases: - you choose the BDB spooling option (not needed for classic spooling) either locally or with the BDB RPC option, and you are upgrading your cluster for N1 Grid Engine 6.0 or 6.0u1 to N1 Grid Engine 6.0u2 or higher 1. For safety reasons, please make a full backup of your existing configuration. To perform a backup use this command % inst_sge -bup 2. Upgrade your BDB database. This is done as follows: % inst_sge -updatedb Restarting the software ----------------------- Please login to your qmaster machine and execution hosts and enter: # /etc/init.d/sgemaster # /etc/init.d/sgeexecd After restarting the software, you may again enable your queues: # qmod -e '*' If you renamed the shepherd binary, you may safely delete the old binary when all jobs which where running prior the patch installation have finished. New functionality delivered since N1GE 6.0 ------------------------------------------- 1. Avoid setting of LD_LIBRARY_PATH; inherited job environment -------------------------------------------------------------- There are two new "execd_params" (defined in the global or local cluster configuration) which control the environment inherited by a job: SET_LIB_PATH INHERIT_ENV By default, SET_LIB_PATH is false and INHERIT_ENV is true. If SET_LIB_PATH is true and INHERIT_ENV is true, each job will inherit the environment of the shell that started the execd, with the N1GE lib directory prepended to the lib path. If SET_LIB_PATH is true and INHERIT_ENV is false, the environment of the shell that started the execd will not be inherited by jobs, and the lib path will contain only the N1GE lib directory. If SET_LIB_PATH is false and INHERIT_ENV is true, each job will inherit the environment of the shell that started the execd with no additional changes to the lib path. If SET_LIB_PATH is false and INHERIT_ENV is false, the environment of the shell that started the execd will not be inherited by jobs, and the lib path will be empty. Environment variables which are normally overwritten by the shepherd, such as PATH or LOGNAME, are unaffected by these new parameters. 2. DRMAA Java[TM] language binding delivered with this patch ------------------------------------------------------------ The DRMAA Java language binding is now available. The DRMAA Java language binding library and documentation is contained in the patch for the "common" package. 3. New qstat options to optimize memory overhead and speed of qstat ------------------------------------------------------------------- The qstat client command has been enhanced to reduce the overall amount of memory which is requested from the qmaster. To enable these changes it is necessary to change the qstat default behavior. This is possible by defining a cluster-global or user-specific sge_qstat file. More information can be found in sge_qstat(5) manual page. In addition two new qstat options ("-u" and "-s") have been introduced to be used with the sge_qstat default file. Find more information in qstat(1). 4. Tuning parameter for sharetree spooling ------------------------------------------ A new "qmaster_param" (configured in the global cluster configuration): STREE_SPOOL_INTERVAL=