Patch-ID# 123431-02 NOTE: *********************************************************************** READ THE TERMS OF THE AGREEMENT ("AGREEMENT") IN THE LEGAL_LICENSE.TXT FILE CAREFULLY BEFORE USING THIS SOFTWARE. BY USING THE SOFTWARE, YOU AGREE TO THE TERMS OF THIS AGREEMENT. IF YOU DO NOT AGREE TO ALL OF THE TERMS, PROMPTLY DESTROY THE UNUSED SOFTWARE. *********************************************************************** Keywords: qmaster scheduler qmon qstat qconf qconf memory usage root Synopsis: N1 Grid Engine 6.0: maintenance patch Date: Oct/19/2006 Install Requirements: See Special Install Instructions Solaris Release: 7 8 9 10 SunOS Release: 5.7 5.8 5.9 5.10 Unbundled Product: N1 Grid Engine Unbundled Release: 6.0 Xref: See patch matrix below Topic: Relevant Architectures: sparc BugId's fixed with this patch: 6424565 6458510 6458517 Changes incorporated in this version: 6458510 6458517 Patches accumulated and obsoleted by this patch: Patches which conflict with this patch: Patches required with this patch: 123037-01 (or greater) Obsoleted by: Files included with this patch: /bin/sol-sparc/qacct /bin/sol-sparc/qalter /bin/sol-sparc/qconf /bin/sol-sparc/qdel /bin/sol-sparc/qhost /bin/sol-sparc/qmake /bin/sol-sparc/qmod /bin/sol-sparc/qmon /bin/sol-sparc/qping /bin/sol-sparc/qsh /bin/sol-sparc/qstat /bin/sol-sparc/qsub /bin/sol-sparc/qtcsh /bin/sol-sparc/sge_coshepherd /bin/sol-sparc/sge_execd /bin/sol-sparc/sge_qmaster /bin/sol-sparc/sge_schedd /bin/sol-sparc/sge_shadowd /bin/sol-sparc/sge_shepherd /bin/sol-sparc/sgepasswd /examples/jobsbin/sol-sparc/work /lib/sol-sparc/libXltree.so /lib/sol-sparc/libcrypto.so.0.9.7 /lib/sol-sparc/libdrmaa.so /lib/sol-sparc/libspoolb.so /lib/sol-sparc/libspoolc.so /lib/sol-sparc/libssl.so.0.9.7 /utilbin/sol-sparc/adminrun /utilbin/sol-sparc/berkeley_db_svc /utilbin/sol-sparc/checkprog /utilbin/sol-sparc/checkuser /utilbin/sol-sparc/db_archive /utilbin/sol-sparc/db_checkpoint /utilbin/sol-sparc/db_deadlock /utilbin/sol-sparc/db_dump /utilbin/sol-sparc/db_load /utilbin/sol-sparc/db_printlog /utilbin/sol-sparc/db_recover /utilbin/sol-sparc/db_stat /utilbin/sol-sparc/db_upgrade /utilbin/sol-sparc/db_verify /utilbin/sol-sparc/filestat /utilbin/sol-sparc/fstype /utilbin/sol-sparc/gethostbyaddr /utilbin/sol-sparc/gethostbyname /utilbin/sol-sparc/gethostname /utilbin/sol-sparc/getservbyname /utilbin/sol-sparc/infotext /utilbin/sol-sparc/loadcheck /utilbin/sol-sparc/now /utilbin/sol-sparc/openssl /utilbin/sol-sparc/qrsh_starter /utilbin/sol-sparc/rlogin /utilbin/sol-sparc/rsh /utilbin/sol-sparc/rshd /utilbin/sol-sparc/sge_share_mon /utilbin/sol-sparc/spooldefaults /utilbin/sol-sparc/spooledit /utilbin/sol-sparc/spoolinit /utilbin/sol-sparc/testsuidroot /utilbin/sol-sparc/uidgid Problem Description: 6458517 unreasonably long scheduler dispatch times if lots of projects are used in share tree 6458510 unreasonably long scheduler dispatch times if lots of cluster queues are deployed in large clusters (from 123431-01) 6424565 jobs with negative priority will be rejected by qmaster Patch Installation Instructions: -------------------------------- For Solaris 7, 8, 9 and 10 releases, refer to the man pages for instructions on using 'patchadd' and 'patchrm' scripts provided with Solaris. Any other special or non-generic installation instructions should be described below as special instructions. The following example installs a patch to a standalone machine: example# patchadd /var/spool/patch/104945-02 The following example removes a patch from a standalone system: example# patchrm 104945-02 For additional examples please see the appropriate man pages. See the "Special Install Instructions" section below before installing this patch. Patch requirements and patch matrix for N1 Grid Engine 6 packages ----------------------------------------------------------------- The patches below update a N1 Grid Engine 6 distribution to N1 Grid Engine 6 Update 8 (N1GE 6.0u8_2). The "-help" output of most commands will print a version string "N1GE 6.0u8_2" after applying the patch. All packages of a N1 Grid Engine 6 distribution must have the same patch level (exception for ARCo - see requirements for ARCo in the ARCo patches README's). Please refer to the patch matrix below which updates the distribution to most recent patch level. It is not supported and possible to mix different patch levels of binaries and the "common" package in a single N1 Grid Engine cluster. 1. Patches for packages in Sun pkgadd format -------------------------------------------- Package name* OS* Architecture* Patch-Id ----------------------------------------------------------------- SUNWsgee Solaris, Sparc, 32bit sol-sparc 123431-02 SUNWsgeex Solaris, Sparc, 64bit sol-sparc64 123432-02 SUNWsgeei Solaris x86 sol-x86 123433-02 SUNWsgeeax Solaris, x64 (AMD64) sol-amd64 123434-02 SUNWsgeec all common 118132-09 SUNWsgeea all arco 118133-06 SUNWsgeed all doc 119846-02 The N1GE 6.0u8_2 release is based on N1GE 6.0u8 and requires the installation of the respective patches listed in the table below. Package name* OS* Architecture* Patch-Id ----------------------------------------------------------------- SUNWsgee Solaris, Sparc, 32bit sol-sparc 123037-01 SUNWsgeex Solaris, Sparc, 64bit sol-sparc64 123038-01 SUNWsgeei Solaris x86 sol-x86 123039-01 SUNWsgeeax Solaris, x64 (AMD64) sol-amd64 123040-01 *Package Name = see pkginfo(1) *OS = Operating system *Architecture = N1 Grid Engine binary architecture string or "common" = architecture independent packages "arco" = Accounting and Reporting console "doc" = PDF documentation "gemm" = Grid Engine Management Module for Sun Control Station (SCS) (tar.gz only) 2. Patches for packages in tar.gz format ---------------------------------------- OS* Architecture Patch-Id ----------------------------------------------------- Solaris, Sparc, 32bit sol-sparc 123435-02 Solaris, Sparc, 64bit sol-sparc64 123436-02 Solaris, x86 sol-x86 123437-02 Solaris, x64 (AMD64) sol-amd64 123438-02 Linux kernel2.4/2.6, x86 lx24-x86 123439-02 Linux kernel2.4/2.6, AMD64 lx24-amd64 123440-02 IBM AIX 4.3 aix43 123047-02 IBM AIX 5.1 aix51 123048-02 Apple MAC OS/X darwin 123049-02 HP HP-UX 11 hp11 123050-02 SGI Irix 6.5 irix65 123051-02 Microsoft Windows win32-x86 123052-03 all common 118092-09 all arco 118093-06 all doc 119861-02 Solaris, Linux gemm 120435-02 Special Install Instructions: ----------------------------- Content ------- Patch Installation Stopping the N1 Grid Engine cluster to start jobs Shutting down the N1 Grid Engine daemons Installing the patch and restarting the software New functionality delivered with N1GE 6.0 Update 8_1 New daemon for starting Windows GUI Applications New functionality delivered with N1GE 6.0 Update 7 Reworked "qstat -xml" output Reworked PE range matching algorithm in the scheduler New monitoring feature in qmaster New parameter for specialized job deletion New reporting parameter to control accounting file flush time New functionality delivered with N1GE 6.0 Update 6 Berkeley DB database tools are included in the distribution New functionality delivered with N1GE 6.0 Update 4 New "qconf -purge" option Berkeley DB spooling on NFSv4 under Solaris 10 supported Execd installation in Solaris 10 zones supported Faster execution daemon reconnect in CSP mode New functionality delivered with N1GE 6.0 Update 2 Avoid setting of LD_LIBRARY_PATH DRMAA Java[TM] language binding delivered with this patch New qstat options to optimize memory overhead and speed of qstat Tuning parameter for sharetree spooling Patch Installation ------------------ NOTE: This patch requires that you update your Berkeley DB database files if you are upgrading from N1GE 6.0u1 or 6.0. Please read the full notes when applying this patch. These installation instructions assume that you are running a homogenous N1 Grid Engine cluster (called "the software") where all hosts share the same directory for the binaries. If you are running the software in a heterogenous environment (mix of different binary architectures), you need to apply the patch installation for all binary architectures as well as the "common" and "arco" packages. See the patch matrix above for details about the available patches. If you installed the software on local filesystems, you need to install all relevant patches on all hosts where you installed the software locally. By default, there should by no running jobs when the patch is installed. There may pending batch jobs, but no pending interactive jobs (qrsh, qmake, qsh, qtcsh). It is possible to install the patch with running batch jobs. To avoid a failure of the active 'sge_shepherd' binary, it is necessary to move the old shepherd binary (and copy it back prior to the installation of the patch). You can not install the patch with running interactive jobs, 'qmake' jobs or with running parallel jobs which use the tight integration support (control_slaves=true in PE configuration is set). A. Stopping the N1 Grid Engine cluster to start jobs ---------------------------------------------------- Disable all queues so that no new jobs are started: # qmod -d '*' Optional (only needed if there are running jobs which should continue to run when the patch is installed): # cd $SGE_ROOT/bin # mv /sge_shepherd /sge_shepherd.sge60 It is important that the binary is moved with the "mv" command. It should not be copied because this could cause the crash of an active shepherd process which is currently running job when the patch is installed. B. Shutting down the N1 Grid Engine daemons ------------------------------------------- You need to shutdown (and restart) the qmaster and scheduler daemon and all running execution daemons. Shutdown all your execution hosts. Login to all your execution hosts and stop the execution daemons: # /etc/init.d/sgeexecd softstop Then login to your qmaster machine and stop qmaster and scheduler: # /etc/init.d/sgemaster stop Now verify with the 'ps' command that all N1 Grid Engine daemons on all hosts are stopped. If you decided to rename the 'sge_shepherd' binary so that running jobs can continue to run during the patch installation, you must not kill the 'sge_shepherd' binary (process). C. Installing the patch and restarting the software --------------------------------------------------- Now install the patch by installing the patch with "patchadd" or by unpacking the 'tar.gz' files included in this patch as outlined above. Berkeley DB database update needed ---------------------------------- NOTE: This update is *not* needed if you already installed N1GE 6.0u3 or higher. The update is only needed if you are upgrading from N1GE 6.0u2 or earlier. After installing this patch, and before restarting your cluster you need to update your Berkeley DB (BDB) database in the following cases: - you choose the BDB spooling option (not needed for classic spooling) either locally or with the BDB RPC option, and you are upgrading your cluster for N1 Grid Engine 6.0 or 6.0u1 to N1 Grid Engine 6.0u2 or higher 1. For safety reasons, please make a full backup of your existing configuration. To perform a backup use this command % inst_sge -bup 2. Upgrade your BDB database. This is done as follows: % inst_sge -updatedb Restarting the software ----------------------- Please login to your qmaster machine and execution hosts and enter: # /etc/init.d/sgemaster # /etc/init.d/sgeexecd After restarting the software, you may again enable your queues: # qmod -e '*' If you renamed the shepherd binary, you may safely delete the old binary when all jobs which where running prior the patch installation have finished. New functionality delivered with N1GE 6.0 Update 8_1 ---------------------------------------------------- 1. New service for starting Windows GUI Applications --------------------------------------------------- A new Windows Service "N1 Grid Engine Helper Service" was added to the distribution. This new service allows Windows jobs to displays a GUI on the visible desktop of the execution host. The visible desktop is either the desktop of the user currently logged in on the execution host or the desktop of the next user who will log in. It's not the log in screen. The Helper Service is a independent component loosely coupled with the execution daemon. The startup of the Helper Service is plugged in the "Services" dialog in the Windows control panel. It's only possible to install one Helper Service per host and it's only supported to have one execution daemon installed per Helper Server. To submit a job that requires a Windows GUI, set the job environment variable "SGE_GUI_MODE" to "TRUE", "true" or "1", e.g. # qsub -v SGE_GUI_MODE=TRUE $SGE_ROOT/examples/jobs/sleeper.sh 60 The current release does not distinct at scheduling time between GUI and non-GUI jobs. Future releases of N1GE will address this feature with a builtin complex variable "display_win_gui". The definition will be: #name shortcut type relop requestable consumable default urgency #---------------------------------------------------------------------------------------- display_win_gui dwg BOOL == YES NO 0 0 To be compatible with future releases of N1GE it's recommended to follow this complex definition. The execution daemon communicates with the Helper Service over a TCP/IP connection using a dynamic port. A firewall might block local connections and needs to be configured to allow local connections. The Windows builtin firewall allows local connections by default. The Helper Service always needs the password of the job user, no matter if the job user is a local or a domain user. See sgepasswd(1) and sgepasswd(5) for informations how to register the users password. Native Windows processes cannot be suspended. Therefore it is not supported to suspend Windows GUI jobs. Though qmod -s/-us can be executed it has no effect on the running GUI process. New functionality delivered with N1GE 6.0 Update 7 -------------------------------------------------- 1. Reworked "qstat -xml" output ------------------------------- The schema for "qstat -xml" and the "qstat -xml" output have been reworked to ensure consistency between them and easy parsing of them via JAXB. The most noticeable change will the date output. It follows now the XML datetime format. 2. Reworked PE range matching algorithm in the scheduler -------------------------------------------------------- The PE range matching algorithm is now adaptable and learns from the past decisions. This will lead to a much faster scheduling decision in case of pe-ranges. This can be controlled by a new scheduling configuration parameter: SELECT_PE_RANGE_ALG. It allows to restore the old behavior. See sge_conf(5) for more information. 3. New monitoring feature in qmaster ------------------------------------ The monitoring allows to get detailed statistics what the qmaster threads are doing and how busy they are. The statistics can be accessed via "qping -f" or from the qmaster messages file. The feature is controlled by two qmaster configuration parameters: MONITOR_TIME specifying the time interval for the statistics LOG_MONITOR_MESSAGE enables/ disables the logging of the monitoring messages into the qmaster messages file. See sge_conf(5) for more information. 4. New parameter for specialized job deletion --------------------------------------------- A new "execd_param" (configured in the global cluster configuration): ENABLE_ADDGRP_KILL=true can be configured to enable addition code within the execution host to delete jobs. If this parameter is set then the supplementary group id's are used to identify all processes which are to be terminated when a job should be deleted. It has only effect for following architectures: sol* lx* osf4 tru64 See sge_conf(5) under "gid_range" for more information. 5. New reporting parameter to control accounting file flush time ---------------------------------------------------------------- A new reporting parameter, "accounting_flush_time", controls the flush period for the accounting file. Previously, both the accounting and reporting files were flush at the same interval. Now they can be set independently. Additionally, buffering of the accounting file can now be disabled, allowing accounting data to be written to the accounting file as soon as it becomes available. See sge_conf(5) for more information. New functionality delivered with N1GE 6.0 Update 6 -------------------------------------------------- 1. Berkeley DB database tools are included in the distribution -------------------------------------------------------------- All Berkeley DB database tools are now part of the N1 Grid Engine distribution (not for Microsoft Windows platform) db_archive db_checkpoint db_deadlock db_dump db_load db_printlog db_recover db_stat db_upgrade db_verify The HTML documentation for these tools is part of the "common" patch and can be found in: /doc/bdbdocs New functionality delivered with N1GE 6.0 Update 4 -------------------------------------------------- 1. New "qconf -purge" option ---------------------------- "qconf -purge" deletes all hosts or hostgroups settings from a cluster queue. This facilitates the uninstallation of host or hostgroups. See qstat(1) for more a description how to use this parameter 2. Berkeley DB spooling on NFSv4 under Solaris 10 supported ----------------------------------------------------------- The Berkeley DB database now can be installed on a NFSv4 mounted filesystem on Solaris 10. For performance reasons it is recommended to use NFSv4 BDB spooling only when the NFSv4 mount provides an excellent high speed connection to the file server. 3. Execd installation in Solaris 10 zones supported --------------------------------------------------- The execution daemon installation in Solaris 10 zones is supported. If an execution daemons is installed in the global zone and in local zones you need to ensure that the additional group id range (-> "gid_range" in cluster configuration) from the global zone and the local zones does not overlap. Local zones may use the same additional group id range in the same host. 4. Faster execution daemon reconnect in CSP mode ------------------------------------------------ The Certificate Security Protocol (CSP) has been reworked and now is fully integrated in the communication library layer. The allows a faster reconnect of execution daemons after qmaster or execution daemon restart. New functionality delivered with N1GE 6.0 Update 2 -------------------------------------------------- 1. Avoid setting of LD_LIBRARY_PATH; inherited job environment -------------------------------------------------------------- There are two new "execd_params" (defined in the global or local cluster configuration) which control the environment inherited by a job: SET_LIB_PATH INHERIT_ENV By default, SET_LIB_PATH is false and INHERIT_ENV is true. If SET_LIB_PATH is true and INHERIT_ENV is true, each job will inherit the environment of the shell that started the execd, with the N1GE lib directory prepended to the lib path. If SET_LIB_PATH is true and INHERIT_ENV is false, the environment of the shell that started the execd will not be inherited by jobs, and the lib path will contain only the N1GE lib directory. If SET_LIB_PATH is false and INHERIT_ENV is true, each job will inherit the environment of the shell that started the execd with no additional changes to the lib path. If SET_LIB_PATH is false and INHERIT_ENV is false, the environment of the shell that started the execd will not be inherited by jobs, and the lib path will be empty. Environment variables which are normally overwritten by the shepherd, such as PATH or LOGNAME, are unaffected by these new parameters. 2. DRMAA Java[TM] language binding delivered with this patch ------------------------------------------------------------ The DRMAA Java language binding is now available. The DRMAA Java language binding library and documentation is contained in the patch for the "common" package. 3. New qstat options to optimize memory overhead and speed of qstat ------------------------------------------------------------------- The qstat client command has been enhanced to reduce the overall amount of memory which is requested from the qmaster. To enable these changes it is necessary to change the qstat default behavior. This is possible by defining a cluster-global or user-specific sge_qstat file. More information can be found in sge_qstat(5) manual page. In addition two new qstat options ("-u" and "-s") have been introduced to be used with the sge_qstat default file. Find more information in qstat(1). 4. Tuning parameter for sharetree spooling ------------------------------------------ A new "qmaster_param" (configured in the global cluster configuration): STREE_SPOOL_INTERVAL=