Patch-ID# 118089-04 Keywords: qmaster scheduler qmon qstat qconf qconf memory usage Synopsis: N1 Grid Engine 6.0: maintenance patch Date: Apr/28/2005 ***************************************************** Patch 118089-04 has been re-instated on July 29, 2005. Patch 118089-05 was withdrawn. Reason for 118089-05 being withdrew: Bug ID 6295165 finished array job tasks can be rescheduled if master/scheduler daemons are stopped/started introduced priority 2 defect 6298233 no user notification or command hanging if an immediate job cannot be scheduled This may result in wrong scheduling behavior and/or follow up errors if command outut is parsed by script wrappers used in integrations Please reference bug ID 6298233 for more information. Recommendation: Backout patch 118089-05 if interactive jobs are used ***************************************************** Install Requirements: Additional instructions may be listed below Solaris Release: SunOS Release: Unbundled Product: N1 Grid Engine Unbundled Release: 6.0 Xref: NOTE: See patch matrix below Topic: Relevant Architectures: mac os/x BugId's fixed with this patch: 5063311 5063316 5063987 5071498 5071502 5071522 5071525 5071539 5071914 5071918 5071987 5072005 5072481 5072772 5073218 5074788 5075346 5075398 5075451 5075849 5075936 5075968 5076358 5076372 5076491 5077165 5077549 5077589 5078783 5079514 5079572 5080779 5080784 5080833 5080836 5080839 5080840 5080851 5080852 5080853 5080856 5081821 5081822 5081839 5082490 5083102 5083115 5084317 5085004 5085010 5085392 5086108 5089222 5089255 5090162 5092487 5094016 5095907 5097732 5102320 5102442 5104270 5104789 5108635 5108639 6174301 6174326 6174331 6174821 6174915 6176115 6176177 6176181 6178843 6180529 6183365 6184460 6184466 6185136 6185169 6185208 6185211 6186597 6189286 6189289 6190164 6191366 6193348 6193361 6193866 6194002 6194625 6194713 6194719 6194729 6195249 6196578 6199256 6199261 6200013 6201033 6201038 6201039 6201040 6201042 6205060 6205648 6211243 6211309 6215580 6215730 6216020 6218430 6218877 6219999 6220060 6221167 6221231 6221244 6222237 6222811 6222861 6222930 6225570 6226085 6228350 6228786 6229253 6229277 6229373 6230846 6231366 6232074 6233162 6233173 6234371 6234836 6236261 6236469 6236472 6236475 6239394 6239569 6239660 6240739 6241376 6241378 6241401 6241430 6241487 6241544 6242055 6242057 6242165 6242172 6242181 6242779 6244215 6244229 6244808 6244865 6245487 6247211 6247238 6247239 6247889 6251172 6251943 6252465 6252524 6253093 6253266 6255329 6255336 6255804 6255850 6255902 6256457 6256530 6259380 6260024 6260729 Changes incorporated in this version: 5085004 6178843 6186597 6194719 6199256 6205060 6215730 6218430 6218877 6219999 6220060 6221167 6221231 6221244 6222237 6222811 6222861 6222930 6225570 6226085 6228350 6228786 6229253 6229277 6229373 6230846 6231366 6232074 6233162 6233173 6234371 6234836 6236261 6236469 6236472 6236475 6239394 6239569 6239660 6240739 6241376 6241378 6241401 6241430 6241487 6241544 6242055 6242057 6242165 6242172 6242181 6242779 6244215 6244229 6244808 6244865 6245487 6247211 6247238 6247239 6247889 6251172 6251943 6252465 6252524 6253093 6253266 6255329 6255336 6255804 6255850 6255902 6256457 6256530 6259380 6260024 6260729 Patches accumulated and obsoleted by this patch: Patches which conflict with this patch: Patches required with this patch: Obsoleted by: Files included with this patch: /bin/darwin/qacct /bin/darwin/qalter /bin/darwin/qconf /bin/darwin/qdel /bin/darwin/qhost /bin/darwin/qmake /bin/darwin/qmod /bin/darwin/qmon /bin/darwin/qping /bin/darwin/qsh /bin/darwin/qstat /bin/darwin/qsub /bin/darwin/qtcsh /bin/darwin/sge_coshepherd /bin/darwin/sge_execd /bin/darwin/sge_qmaster /bin/darwin/sge_schedd /bin/darwin/sge_shadowd /bin/darwin/sge_shepherd /lib/darwin/libXltree.dylib /lib/darwin/libcrypto.0.9.7.dylib /lib/darwin/libdrmaa.dylib /lib/darwin/libdrmaa.jnilib /lib/darwin/libspoolb.dylib /lib/darwin/libspoolc.dylib /lib/darwin/libssl.0.9.7.dylib /lib/darwin/libssl.bundle /utilbin/darwin/adminrun /utilbin/darwin/berkeley_db_svc /utilbin/darwin/checkprog /utilbin/darwin/checkuser /utilbin/darwin/db_archive /utilbin/darwin/db_checkpoint /utilbin/darwin/db_dump /utilbin/darwin/db_load /utilbin/darwin/filestat /utilbin/darwin/fstype /utilbin/darwin/gethostbyaddr /utilbin/darwin/gethostbyname /utilbin/darwin/gethostname /utilbin/darwin/getservbyname /utilbin/darwin/infotext /utilbin/darwin/loadcheck /utilbin/darwin/now /utilbin/darwin/openssl /utilbin/darwin/qrsh_starter /utilbin/darwin/rlogin /utilbin/darwin/rsh /utilbin/darwin/rshd /utilbin/darwin/sge_share_mon /utilbin/darwin/spooldefaults /utilbin/darwin/spooledit /utilbin/darwin/spoolinit /utilbin/darwin/testsuidroot /utilbin/darwin/uidgid Problem Description: 6260729 Can't select 'slots' in select box when adding consumables for execution host 6260024 qmon cluster queue modify cancel not working correct 6259380 potential qmaster sec. fault. 6256530 cqueues/all.q trashed after qmaster shutdown with 1362 hosts 6256457 pe jobs disappear in t state (execd doesn't know this job) 6255902 qmake in dynamic allocation mode core dump 6255850 the usage in projects is never spooled while the qmaster 6255804 job in error state breaks qstat -f -xml 6255336 execd does sends empty job report for a pe slave task 6255329 qmaster does not store sharetree usage on shutdown 6253266 failed array tasks are rescheduled only one by one 6253093 qstat -f -pe make breaks 6252524 Missing success message with qconf -Aprj 6252465 qsub option parameter string only supports 2048 character strings 6251943 japi does not work with host aliasing 6251172 reserved jobs prevent other jobs from starting 6247889 qsub -sync y return code behaviour broken 6247239 sequence nr of execd load reports corrupted 6247238 qsub fails to work correctly with -b n -cwd 6247211 qstat -explain E does not print queue errors correctly 6245487 qhost -h does not show selected host 6244865 a series of matching soft queue requests gets not counted separately 6244808 scheduler does not get all objects on a qmaster or scheduler startup 6244229 misleading qstat -j message when the scheduler is not running 6244215 qsub -b y must fail if no command is specified 6242779 qsub -now yes not working on CSP system 6242181 Failed drmaa_control (DRMAA_CONTROL_TERMINATE) causes deadlock 6242172 Multi-threaded args parsing problems 6242165 Profiling library never frees thread slots 6242057 jobs which request consumable resources which are set to infinity are not scheduled 6242055 Consumable request may not be 0 if PE requested 6241544 qstat -F dies in case of a infinit integer setting 6241487 termination script may not be ignored, when job submited with -notify 6241430 error message "no execd known on host" 6241401 Conflicting requirements should have the same meaning with qstat and qsub 6241378 Reservation of wrong hosts 6241376 qstat -U aborts 6240739 qstat -s hu shows pending jobs only 6239660 qmaster profiling doesn't start at qmaster startup 6239569 qmaster does not accept new connections if number of execd's exceed FD_SETSIZE 6239394 Spooledit fails during database upgrade 6236475 DRMAA segfaults with > 255 threads 6236472 qsub -sync y doesn't remove session directories 6236469 JAPI: Can be made to start two event client threads 6236261 BDB install on NFSv4 share 6234836 Need a means to purge host or hostgroup specific cluster queue 6234371 error message from execd about endpoint is not unique 6233173 qloadsensor dies sporadically 6233162 global scheduler messages are reported multiple times 6232074 load formula is not working for pe jobs 6231366 deadlock in the qmaster due to qconf -k[s|e] 6230846 execd logs error mesage, when a tight pe job in "t" state is deleted 6229373 An array pe job can set queues into error state 6229277 qselect uses sge_qstat file 6229253 a parallel array job can kill the qmaster 6228786 Long delay when starting up large pe jobs 6228350 Execd messages file contains incorrectly-formatted lines 6226085 suspend_interval is ignored when enabling jobs due to suspend_thresholds change 6225570 sharetree has a usage leak 6222930 After shadowd takes over there is a long delay before execd connects to new qmaster 6222861 error message "no execd known on host" 6222811 scheduler can get out of sync 6222237 huge CPU and memory overhead when modifiying complex attributes 6221244 releasing user hold state through qrls may not require manager priviledges 6221231 qsub -sync y return code behaviour broken 6221167 sge_schedd segfaults in case of a restart and a running pe job. 6220060 wrong calendar settings kills the qmaster 6219999 changing of local execd_spool_dir is fault prone 6218877 qstat -t is broken 6218430 Problems with load values if execution daemons run in a solaris zone at x86 6215730 qdel failed to delete qrsh (login) job on a Solaris box when Secure Shell is used 6205060 SGE tools segfault when gid can't be looked up 6199256 qconf -[a|A|m|M]stree kills qmaster 6194719 starter_method is ignored with binary jobs that are started without a shell 6186597 qconf error diagnosis broken 6178843 qconf changes to complex doesn't display all the changes made upon exit 5085004 qstat -f -q all.q@HOSTNAME does not resolve hostname (from 118089-03) 6205648 error in commlib read/write timeout handling 6211243 The qstat -ext -xml command is broken with N1GE6 Update 2 patch 6211309 qmaster running out of file descriptors 6215580 execd messages file contains errors for tight integrated jobs 6216020 pending job task deletion may not work (from 118089-02) 5075968 Thread enabled commlib coredumps on exit on a 32bit Solaris x86 box 5085010 qmon customize filter for running jobs does not filter 5086108 wrong message appears when queue instance becomes error state 5089222 scheduling weirdness with wild-card PE's 5089255 Submit to a queue domain is never scheduled 5090162 qmake does not export shell env. vars 5092487 hard resource requests ignored in parallel jobs 5094016 o-tickets assigned to departments are ignored 5095907 qacct -l is not working 5097732 Need detailed error messages from communication layer 5102320 memory leak in the scheduler, with pe jobs and resource requests 5102442 qconf -de crashes qmaster 5104270 Cannot add calendar with \ syntax 5104789 mail sent by qmaster leaves zombie processes 5108635 $ARCH required in path for qloadsensor and qidle. 5108639 qconf -sstree seg faults with large share trees 6174301 N1GE6: qsub -js and negative job_share numbers acts strangely/unexpectedly. 6174326 qconf -sq displayes "slots" in the complex_values line 6174331 Option "-v VAR" does not fetch from envrionment 6174821 segmentation fault when vmemsize limit is reached 6174915 qconf has wrong exit status 6176115 Show qmaster/execd application status in qping 6176177 restoring a backup does not restore the job_scripts dir. 6176181 qdel "" kills qmaster 6180529 meaningless job error state diagnosis text in qstat -j 6183365 qconf -sstree gives a SIGBUS error 6184460 qmod -[d|e] cannot handle the folowing qnames: "[0-9]*" 6184466 scheduler does not look ahead to consider queue calendars state transitions 6185136 Job customize shows weird characters for fields, additional fields cannot be added 6185169 qmon returns an error dialog, when editing a calendar 6185208 qmon and equal job arguments 6185211 Job environments should not include Grid Engine dynamic library path 6189286 memory leak in the scheduler with consumables as load thresholds 6189289 a cluster queue can be deleted, even though it is referenced in an other cq 6190164 too many array tasks are deleted 6191366 tightly integrated pe jobs: scheduler doesn't respect usage of pe tasks in sharetree calculation 6193348 qconf -mq does not output the subordinate_list correct 6193361 Jobs fail in case of NFS execd installation on volumes exported without root write priviledges 6193866 backup/restore does not work under Linux and others.. 6194002 sgemaster -migrate on qmaster host tries to start second qmaster 6194625 subordinate queues consume excessive memory 6194713 Only first subordinate queue will be suspended at qmaster restart 6194729 Subordinate queue thresholds are not spooled with BDB 6195249 QMON Cluster Queue Window: Heading line words does not match into column width 6196578 backup failes, when... 6199261 a sharetree delete can kill qmon 6200013 arch script does not know about /lib64 6201033 qmaster might fail if jobs are deleted which have multiple hold states applied 6201038 reduce the impact of qstat on the overall performance 6201039 qconf -ks gives bad error message if scheduler isn't running 6201040 Exit 99 jobs are not rescheduled to hosts where they ran before 6201042 qdel "*" produces error logging in qmaster messages file (from 118089-01) 5063311 high memory usage of schedd and qmaster (schedd_job_info) 5063316 PE job submit error, when qmaster is busy 5063987 qmaster cannot bind port below 1024 on Linux 5071498 projects not available after sge_qmaster restart 5071502 calendars broken 5071522 Startup of qmaster changes act_qmaster to `hostname` 5071525 qalter abort 5071539 qping doesn't support host_aliases file 5071914 scheduler ignores queue seqno for queue sorting 5071918 qmod -e '@' causes segmentation fault in qmaster 5071987 Qmaster requires a local conf in order to start. 5072005 drmaa_run_job() may change the current directory 5072481 Deleted pending job appears in qstat 5072772 sge_qmaster constantly rewrites spool files of tightly integrated parallel jobs 5073218 qconf -aq @ crashes qmaster 5074788 jobs on hold due to -a time cause qmaster/schedd get out of sync 5075346 Sharetree doesn't work correct 5075398 variable syntax : equal sign support 5075451 sched_conf(5) reprioritize_interval should default to 0 5075849 a registering event client can get events before it got its total update 5075936 qmon's queue filtering doesn't work 5076358 It shuld be used "." and "$" with qsub -N 5076372 "|" should be able to be used with qsub -N 5076491 qmaster clients may not reconnect after qmaster outage 5077165 reprioritize_interval descr in sched_conf(5) needs improvemen 5077549 qsub -N "@" causes qmaster down 5077589 schedd and qmaster get out of sync - no scheduling for long time 5078783 Wallclock time limit in qmon 5079514 execd shutdown with sgeexecd fails when host aliases are used 5079572 Resending queue signals broken 5080779 qconf -de host does not update the host groups 5080784 qselect crash 5080833 qconf -mattr dumps core if used incorrectly 5080836 qhosts outputs NCPU as float 5080839 qconf -mq displayes "slots" in the complex_values line 5080840 problems when qconf -mattr is used in conjunction with host_aliases file 5080851 qalter/qdel/qmod abort 5080852 qconf -aq @ crashes qmaster 5080853 DRMAA doesn't reject jobs that never will be dispatchable 5080856 QCONF: qconf -mc segfaults 5081821 qstat XML output typo 5081822 Deleting a queue instance slots value actually adds it 5081839 qconf -ahgrp fails if no hgrp name is specified 5082490 qstat -ext -urg omits time info 5083102 hostgroup changes do not always take effect. 5083115 Need more verbose diagnosis msg if execd port is already bound 5084317 Invalid job_id's in reporting file (only l24_amd64) 5085392 qstat -j -xml generates no parseble xml output Patch Installation Instructions: -------------------------------- tar.gz Patch Installation: -------------------------- See the patch installation instructions below before installing this patch! Patches in 'tar.gz' format cannot be installed with 'patchadd' on Solaris systems. The patch is installed by unpacking the 'tar.gz' file(s) in this directory in . is usually your directory. The installation of this patch later is not visible with the "showrev -p" command on Solaris. This patch cannot be backed out. You may want to make a backup copy of the files before installing this patch since the files will be overwritten. Please read "Install Instructions" later in this file and carry out all steps before you unpack the 'tar.gz' file(s) included in this patch. This patch in 'tar.gz' format should not be installed if the original package has been installed with 'pkgadd' on Solaris. If the original installation used packages ('pkgadd') utility, install the available patches for N1 Grid Engine 6; refer to the patch matrix below. The patch is installed by user root by unpacking the file(s) in the directory where the original package has been installed: # cd # gzip -dc / | tar xvpf - After installing the patch, you should correct the file permissions if your Sun Grid Engine installation is installed as an "admin user" system: # cd # util/setfileperm.sh Patch requirements and patch matrix for N1 Grid Engine 6 packages ----------------------------------------------------------------- The patches below update a N1 Grid Engine 6 distribution to N1 Grid Engine 6 Update 4 (N1GE 6.0u4). The "-help" output of most commands will print a version string "N1GE 6.0u4" after applying the patch. All packages of a N1 Grid Engine 6 distribution must have the same patch level (exception for ARCo - see below under "Installation note for ARCo patches"). Please refer to the patch matrix below which updates the distribution to most recent patch level. It is not supported and possible to mix different patch levels of binaries and the "common" package in a single N1 Grid Engine cluster. 1. Patches for packages in Sun pkgadd format -------------------------------------------- Package name* OS* Architecture* Patch-Id ----------------------------------------------------------------- SUNWsgee Solaris, Sparc, 32bit sol-sparc 118094-04 SUNWsgeex Solaris, Sparc, 64bit sol-sparc64 118130-04 SUNWsgeex Solaris x86 sol-x86 118131-04 SUNWsgeec all common 118132-04 SUNWsgeea all arco 118133-04 *Package Name = see pkginfo(1) *OS = Operating system *Architecture = N1 Grid Engine binary architecture string or "common" = architecture independent packages "arco" = Accounting and Reproting console 2. Patches for packages in tar.gz format ---------------------------------------- OS* Architecture Patch-Id ----------------------------------------------------- Solaris, Sparc, 32bit sol-sparc 118082-04 Solaris, Sparc, 64bit sol-sparc64 118083-04 Solaris, x86 sol-x86 118084-04 Linux kernel2.4/2.6, x86 lx24-x86 118085-04 Linux kernel2.4/2.6, AMD64 lx24-amd64 118086-04 IBM AIX 4.3 aix43 118087-04 IBM AIX 5.1 aix51 118088-04 Apple MAC OS/X darwin 118089-04 HP HP-UX 11 hp11 118090-04 SGI Irix 6.5 irix65 118091-04 all common 118092-04 all arco 118093-04 Installation note for ARCo patches 118133 or 118093 --------------------------------------------------- With N1 Grid Engine 6 Update 4 (or higher) the look&feel of ARCo has been updated to use the native Sun Web Console (SWC) 2.2 GUI controls. The installation of the ARCo patch requires a new installation of SWC 2.2. You may not install the ARCo patch unless you can install SWC 2.2. SWC 2.2 is bundled with the N1 Grid Engine 6 Update 4 CDROM or can be downloaded with the full N1GE 6.0u4 distribution from Sun Download Center (SDLC). Please contact your Sun Microsystems account manager how to get access to N1GE 6.0u4 for no additional license fees if you have a valid support contract. The distribution also contains a new version of the manuals which describe the installation and use of ARCo. You can install all other N1GE6 patches without updating ARCo to the most recent patch level, however you cannot make a new installation of older ARCo packages of N1GE 6.0/6.0u1/6.0u2/6.0u3 once you installed the N1GE 6.0u4 binary and "common" patches. Special Install Instructions: ----------------------------- NOTE: This patch requires that you update your Berkeley DB database files if you are upgrading from N1GE 6.0u1 or 6.0. Please read the full notes when applying this patch. These installation instructions assume that you are running a homogenous N1 Grid Engine cluster (called "the software") where all hosts share the same directory for the binaries. If you are running the software in a heterogenous environment (mix of different binary architectures), you need to apply the patch installation for all binary architectures as well as the "common" and "arco" packages. See the patch matrix above for details about the available patches. If you installed the software on local filesystems, you need to install all relevant patches on all hosts where you installed the software locally. By default, there should be no running jobs when the patch is installed. There may pending batch jobs, but no pending interactive jobs (qrsh, qmake, qsh, qtcsh). It is possible to install the patch with running batch jobs. To avoid a failure of the active 'sge_shepherd' binary, it is necessary to move the old shepherd binary (and copy it back prior to the installation of the patch). You can not install the patch with running interactive jobs, 'qmake' jobs or with running parallel jobs which use the tight integration support (control_slaves=true in PE configuration is set). Stopping the N1 Grid Engine cluster to start jobs ------------------------------------------------- Disable all queues so that no new jobs are started: # qmod -d '*' Optional (only needed if there are running jobs which should continue to run when the patch is installed): # cd $SGE_ROOT/bin # mv /sge_shepherd /sge_shepherd.sge60 It is important that the binary is moved with the "mv" command. It should not be copied because this could cause the crash of an active shepherd process which is currently running job when the patch is installed. Shutting down the N1 Grid Engine qmaster, scheduler and execution daemons ------------------------------------------------------------------------- You need to shutdown (and restart) the qmaster and scheduler daemon and all running execution daemons. Shutdown all your execution hosts. Login to all your execution hosts and stop the execution daemons: # /etc/init.d/sgeexecd softstop Then login to your qmaster machine and stop qmaster and scheduler: # /etc/init.d/sgemaster stop Now verify with the 'ps' command that all N1 Grid Engine daemons on all hosts are stopped. If you decided to rename the 'sge_shepherd' binary so that running jobs can continue to run during the patch installation, you must not kill the 'sge_shepherd' binary (process). Installing the patch and restarting the software ------------------------------------------------ Now install the patch by installing the patch with "patchadd" or by unpacking the 'tar.gz' files included in this patch as outlined above. Berkeley DB database update needed ---------------------------------- NOTE: This update is not needed if you already installed N1GE 6.0u3 or higher. The update is only needed if you are upgrading from N1GE 6.0u1 or earlier. After installing this patch, and before restarting your cluster you need to update your Berkeley DB (BDB) database in the following cases: - you choose the BDB spooling option (not needed for classic spooling) either locally or with the BDB RPC option, and you are upgrading your cluster for N1 Grid Engine 6.0 or 6.0u1 to N1 Grid Engine 6.0u2 or higher 1. For safety reasons, please make a full backup of your existing configuration. To perform a backup use this command % inst_sge -bup 2. Upgrade your BDB database. This is done as follows: % inst_sge -updatedb Restarting the software ----------------------- Please login to your qmaster machine and execution hosts and enter: # /etc/init.d/sgemaster # /etc/init.d/sgeexecd After restarting the software, you may again enable your queues: # qmod -e '*' If you renamed the shepherd binary, you may safely delete the old binary when all jobs which where running prior the patch installation have finished. New functionality delivered since N1GE 6.0 ------------------------------------------- 1. Avoid setting of LD_LIBRARY_PATH; inherited job environment -------------------------------------------------------------- There are two new "execd_params" (defined in the global or local cluster configuration) which control the environment inherited by a job: SET_LIB_PATH INHERIT_ENV By default, SET_LIB_PATH is false and INHERIT_ENV is true. If SET_LIB_PATH is true and INHERIT_ENV is true, each job will inherit the environment of the shell that started the execd, with the N1GE lib directory prepended to the lib path. If SET_LIB_PATH is true and INHERIT_ENV is false, the environment of the shell that started the execd will not be inherited by jobs, and the lib path will contain only the N1GE lib directory. If SET_LIB_PATH is false and INHERIT_ENV is true, each job will inherit the environment of the shell that started the execd with no additional changes to the lib path. If SET_LIB_PATH is false and INHERIT_ENV is false, the environment of the shell that started the execd will not be inherited by jobs, and the lib path will be empty. Environment variables which are normally overwritten by the shepherd, such as PATH or LOGNAME, are unaffected by these new parameters. 2. DRMAA Java[TM] language binding delivered with this patch ------------------------------------------------------------ The DRMAA Java language binding is now available. The DRMAA Java language binding library and documentation is contained in the patch for the "common" package. 3. New qstat options to optimize memory overhead and speed of qstat ------------------------------------------------------------------- The qstat client command has been enhanced to reduce the overall amount of memory which is requested from the qmaster. To enable these changes it is necessary to change the qstat default behavior. This is possible by defining a cluster-global or user-specific sge_qstat file. More information can be found in sge_qstat(5) manual page. In addition two new qstat options ("-u" and "-s") have been introduced to be used with the sge_qstat default file. Find more information in qstat(1). 4. Tuning parameter for sharetree spooling ------------------------------------------ A new "qmaster_param" (configured in the global cluster configuration): STREE_SPOOL_INTERVAL=