Patch-ID# 118089-07 NOTE: *********************************************************************** READ THE TERMS OF THE AGREEMENT ("AGREEMENT") IN THE LEGAL_LICENSE.TXT FILE CAREFULLY BEFORE USING THIS SOFTWARE. BY USING THE SOFTWARE, YOU AGREE TO THE TERMS OF THIS AGREEMENT. IF YOU DO NOT AGREE TO ALL OF THE TERMS, PROMPTLY DESTROY THE UNUSED SOFTWARE. *********************************************************************** Keywords: qmaster scheduler qmon qstat qconf qconf memory usage Synopsis: N1 Grid Engine 6.0: maintenance patch Date: Dec/06/2005 Install Requirements: Additional instructions may be listed below Solaris Release: SunOS Release: Unbundled Product: N1 Grid Engine Unbundled Release: 6.0 Xref: See patch matrix below Topic: Relevant Architectures: mac os/x BugId's fixed with this patch: 4769608 5063311 5063316 5063987 5071498 5071502 5071522 5071525 5071539 5071914 5071918 5071987 5072005 5072481 5072772 5073218 5074788 5075346 5075398 5075451 5075849 5075936 5075968 5076358 5076372 5076491 5077165 5077549 5077589 5078783 5079514 5079572 5080779 5080784 5080833 5080836 5080839 5080840 5080851 5080852 5080853 5080856 5081821 5081822 5081839 5082490 5083102 5083115 5084317 5085004 5085010 5085392 5086108 5089222 5089255 5090162 5092487 5094016 5095907 5097732 5102320 5102442 5104270 5104789 5108635 5108639 6174301 6174326 6174331 6174821 6174915 6176115 6176177 6176181 6178843 6180529 6183365 6184460 6184466 6185136 6185169 6185208 6185211 6186597 6189286 6189289 6190164 6191366 6193348 6193361 6193866 6194002 6194625 6194713 6194719 6194729 6195249 6196578 6199256 6199261 6200013 6201033 6201038 6201039 6201040 6201042 6205060 6205648 6207868 6211243 6211309 6215580 6215730 6216020 6218430 6218877 6219999 6220060 6221167 6221231 6221244 6222237 6222811 6222861 6222930 6225570 6226085 6228350 6228786 6229253 6229277 6229373 6230846 6231366 6232074 6233162 6234371 6234836 6236261 6236469 6236472 6236475 6239394 6239470 6239569 6239660 6240739 6241376 6241378 6241401 6241430 6241487 6241544 6242055 6242057 6242165 6242169 6242172 6242181 6242779 6244215 6244229 6244808 6244865 6245487 6247211 6247238 6247239 6247889 6250603 6250692 6251172 6251943 6252465 6252469 6252524 6252525 6253093 6253266 6253860 6255111 6255329 6255336 6255804 6255850 6255902 6256457 6256530 6256590 6259380 6260024 6260656 6260729 6264592 6265154 6266392 6266450 6267238 6267245 6267932 6268707 6268799 6269305 6269411 6273006 6273217 6274467 6275789 6277909 6278147 6278727 6279402 6279409 6279523 6280698 6281440 6281462 6282996 6283308 6285898 6286510 6286533 6287847 6287850 6287862 6287865 6287867 6287935 6287946 6287953 6287955 6287958 6288156 6288588 6288626 6289455 6291016 6291023 6292742 6292751 6292926 6293411 6294052 6294397 6294875 6295165 6295791 6298056 6298233 6299345 6299351 6299939 6299982 6301047 6303671 6304466 6304471 6304490 6305095 6306229 6306834 6307557 6313445 6314019 6315111 6316995 6317028 6317048 6318018 6318660 6319228 6319231 6319233 6320683 6320869 6322498 6327427 6328703 6329832 6332876 6332877 6333407 6333467 6336519 6338314 6339756 6342005 6346696 6346704 6348299 6348478 6348516 6348517 6349818 6349972 6351728 6353526 6353638 6354143 6354164 6355263 Changes incorporated in this version: 6207868 6242169 6250692 6253860 6255111 6256590 6268799 6275789 6279523 6282996 6286510 6287847 6287850 6287862 6287865 6287935 6287946 6287953 6287955 6288626 6289455 6291016 6291023 6292742 6292751 6292926 6293411 6294052 6294875 6295791 6299982 6301047 6303671 6304466 6304471 6304490 6305095 6306229 6306834 6307557 6313445 6314019 6315111 6316995 6317028 6317048 6318018 6318660 6319228 6319231 6319233 6320683 6320869 6322498 6327427 6328703 6329832 6332876 6332877 6333407 6333467 6336519 6338314 6339756 6342005 6346696 6346704 6348299 6348478 6348516 6348517 6349818 6349972 6351728 6353526 6353638 6354143 6354164 6355263 Patches accumulated and obsoleted by this patch: Patches which conflict with this patch: Patches required with this patch: Obsoleted by: Files included with this patch: /bin/darwin/qacct /bin/darwin/qalter /bin/darwin/qconf /bin/darwin/qdel /bin/darwin/qhost /bin/darwin/qmake /bin/darwin/qmod /bin/darwin/qmon /bin/darwin/qping /bin/darwin/qsh /bin/darwin/qstat /bin/darwin/qsub /bin/darwin/qtcsh /bin/darwin/sge_coshepherd /bin/darwin/sge_execd /bin/darwin/sge_qmaster /bin/darwin/sge_schedd /bin/darwin/sge_shadowd /bin/darwin/sge_shepherd /bin/darwin/sgepasswd /examples/jobsbin/darwin/work /lib/darwin/libXltree.dylib /lib/darwin/libcrypto.0.9.7.dylib /lib/darwin/libdrmaa.dylib /lib/darwin/libdrmaa.jnilib /lib/darwin/libspoolb.dylib /lib/darwin/libspoolc.dylib /lib/darwin/libssl.0.9.7.dylib /lib/darwin/libssl.bundle /utilbin/darwin/adminrun /utilbin/darwin/berkeley_db_svc /utilbin/darwin/checkprog /utilbin/darwin/checkuser /utilbin/darwin/db_archive /utilbin/darwin/db_checkpoint /utilbin/darwin/db_deadlock /utilbin/darwin/db_dump /utilbin/darwin/db_load /utilbin/darwin/db_printlog /utilbin/darwin/db_recover /utilbin/darwin/db_stat /utilbin/darwin/db_upgrade /utilbin/darwin/db_verify /utilbin/darwin/filestat /utilbin/darwin/fstype /utilbin/darwin/gethostbyaddr /utilbin/darwin/gethostbyname /utilbin/darwin/gethostname /utilbin/darwin/getservbyname /utilbin/darwin/infotext /utilbin/darwin/loadcheck /utilbin/darwin/now /utilbin/darwin/openssl /utilbin/darwin/qrsh_starter /utilbin/darwin/rlogin /utilbin/darwin/rsh /utilbin/darwin/rshd /utilbin/darwin/sge_share_mon /utilbin/darwin/spooldefaults /utilbin/darwin/spooledit /utilbin/darwin/spoolinit /utilbin/darwin/testsuidroot /utilbin/darwin/uidgid Problem Description: 6355263 reschedule of a parallel job crashes the qmaster 6354164 drmaa does not work on hp11 platform 6354143 mutually subordinating queues suspend each other simultaneously 6353526 reprioritize field in qmon cluster config missing 6351728 installation of qmaster failed when using /etc/services 6349972 DRMAA crashes during some operations on bulk jobs 6349818 an additional started schedd/execd daemons may not stop if started when qmaster is down 6348517 job finish although terminate method is still running 6348516 job finish does not terminate all processes of a job 6348299 qconf -mstree aborts 6346704 qrsh -V doesn't always work. 6346696 connection to Berkeley DB RPC server can timeout 6342005 a scheduler configuration change with a sharetree can result in a usage leak 6339756 Quotes in qtask file can result in memory corruption 6338314 occasional "failed to deliver job" errors due to SIGPIPE in sge_execd 6336519 changing the cwd flag in qmon - qalter has no effect 6333467 sgemaster -migrate may not delete qmaster lock file and may break shadowd functionality 6333407 configuring the halflife_decay_list crashes the qmaster 6332877 qstat -pe filter does not work 6332876 qstat -U does not consider queue access for job and project access for queues 6329832 qconf and qmaster accept invalid settings for queue complex_values 6328703 fstype does not recognize nfs4 share in all cases 6327427 qping core dump with enabled message content dump 6322498 calendar syntax "week mon=0-21" corrupts SGE and may crash qmaster 6320869 sge_qmaster daemon is running on both the master and shadow nodes after a long network failure 6320683 Binary switch reversed in job category and can cause application to hang 6319233 Parsing of context variable options fails for values containing commas in single quotes 6319231 unable to delete a configuration of a non existing host 6319228 Backslash line continuation is broken for host groups 6318660 the system hold on an array task can vanish 6318018 shepherd doesn't handle qrlogin/qrsh jobs correctly 6317048 Memory leaks in drmaa library, japi_wait and drmaa_job2sge_job 6317028 Quotes in job category can result in memory corruption 6316995 qconf -mp prints error messages two times 6315111 doing a qalter -l rsc=val on running jobs breaks consumable debit 6313445 Qrsh tries to free invalid pointer 6307557 qhost returns wrong total_memory value on MacOSX 10.3 6306834 consumables as thresholds are not working correctly with pe jobs 6306229 wrong soft requests decision 6305095 qstat schema files are incomplete 6304490 qconf -as/-ah leads to segmentation fault 6304471 qlogin -R does not work like documented 6304466 qmaster crashes with large number of qconf -aattr calls 6303671 DRMAA can abort in the middle of a session if NIS becomes unavailable 6301047 qstat -s p doesn't show pending array tasks while there are tasks of this job running 6299982 Slow submission rate with drmaa_run_job() 6295791 qacct -h should not resolve hostnames 6294875 CSP: consolidate error output if cert CA on client and server don't match 6294052 suspend threshold is not working for calendar disabled queues 6293411 NFS write error on host : Permission denied. 6292926 qconf -mattr can crash qmaster 6292751 admin mail information is incorrect 6292742 tight integration - qrsh_exit_code file not written 6291023 qstat -j doesn't print delimiter between jobs 6291016 qmon startup and queue add/modify warning messages 6289455 qstat -XML output does not match the schema 6288626 default PATH variable set for job insufficient for non-login shell jobs 6287955 strange reservation 6287946 qconf -[dm]attr gets confused by shortcuts 6287935 qmod -sq can kill a pe job in t state 6287865 qrsh default job names are not consistent with documented job name limitations 6287862 qhost -l for complexes is broken 6287850 Allow SIGTRAP to enable debugging 6287847 qstat -j shows wrong message for parallel jobs which can't be dispatched 6286510 delivery of queue based signals to execd repeated endlessly 6282996 use of IP address as host name disables unique hostname resolving 6275789 soft requirements on load values are ignored 6268799 confusing execd startup messages and delays in case of problems 6256590 qconf -mq disallows 2057 hostspecific profiles in slots configuration 6255111 Binary jobs are problematic for starter and epilog scripts 6253860 First character is lost in quoting 6250692 accounting(5) record can't be made available immediately after job finish 6242169 Multi-threaded, multi-CPU username problems 6207868 wording with qconf -cq should be changed 6287953 repeated logging of the error message: "failed building category string for job N" The following Change Request (CR) is related only to Grid Engine running under the Microsoft Windows operating system family 6353638 default process priority on windows freezes the whole system until job is finished 6348478 install script sets rsh_daemon to /usr/sbin/in.rshd on win32-x86 6314019 qloadsensor.exe uses up more and more handles 6279523 qlogin on windows does not work! (from 118089-06) 6299939 distribution should contain all Berkeley DB utilities 6299351 qrsh fails when execd_param INHERIT_ENV=false and no ARC set in sge_execd environment 6299345 No error messages in case SSL initialization failes 6298233 no user notification or command hanging if an immediate job cannot be scheduled 6298056 INHERIT_ENV and SET_LIB_PATH are not reset by setting execd_params to NONE (from 118089-05) 6295165 finished array job tasks can be rescheduled if master/scheduler daemons are stopped/started 6294397 wrong drmaa jnilib link on MacOS 6288588 jobs submitted with -v PATH do not retain $TMPDIR prefixed by N1GE as required for tight integration 6288156 sge_shepherd SEGV's when it tries to fopen the usage file 6287958 suspend not working under Mac OS X 6287867 tight integration: temporary files are not deleted at task exit 6286533 job wallclock monitoring and enforcement considers prolog/epilog runtime part of net job runtime 6285898 qconf -Xattr does not resolve fqdn hostnames 6283308 overhead with job execution could lead to overoptimistic backfilling and break resource reservation 6281462 qmaster profiling can only be turned on by restarting qmaster 6281440 resource allocation shown by qstat/qhost not consistent with resource utilization 6280698 Resource filtering with qhost broken 6279409 qconf -tsm command generates too much data (very large schedd_runlog file) 6279402 drmaa_exit() causes qmaster error logging if host is no admin host 6278727 qstat -xml -urg output contains badly formatted numbers 6278147 drmaa_job_ps() returns DRMAA_PS_QUEUED_ACTIVE for finished array job rather than DRMAA_PS_DONE 6277909 qconf -mq coredumps 6274467 qmon kills a system 6273217 race condition with qsub -sync and drmaa_wait() if job exits directly after being submitted 6273006 qstat -j "" results in a segmentation fault 6269411 Close integration cause jobscripts with multiple mprun commands to be killed. 6269305 qrsh/qsh/qlogin reject -js option 6268707 job_load_adjustements is not correctly working when parallel jobs are submitted. 6267932 high CPU load of qmaster even on empty cluster 6267245 Repeated logging of the same message produces giant logging files 6267238 Multithreaded DRMAA may crash due to use of sge_strtok() 6266450 performace bottleneck with subordinate list 6266392 Performance problem with qconf -mattr exechost XX XX global 6265154 Wildcards in PE Name Cause Unusual Behavior 6264592 drmaa_control(DRMAA_JOB_IDS_SESSION_ALL, DRMAA_CONTROL_SUSPEND|RESUME) returns INVALID_JOB error 6260656 incomplete resource reservation with array jobs 6252525 qmon: complex attributes not removeable 6252469 missleading qstat -j messages in case of resource reservation 6250603 qmon crash (segmentation fault) on Solaris64 6218877 qstat -t is broken 4769608 qalter shows wrong priority number when using negative priorities with -p option The following Change Request (CR) is related only to Grid Engine running under the Microsoft Windows operating system family 6239470 Avoid that sge_execd has to be started by the Domain Administrator (from 118089-04) 6260729 Can't select 'slots' in select box when adding consumables for execution host 6260024 qmon cluster queue modify cancel not working correct 6259380 potential qmaster sec. fault. 6256530 cqueues/all.q trashed after qmaster shutdown with 1362 hosts 6256457 pe jobs disappear in t state (execd doesn't know this job) 6255902 qmake in dynamic allocation mode core dump 6255850 the usage in projects is never spooled while the qmaster 6255804 job in error state breaks qstat -f -xml 6255336 execd does sends empty job report for a pe slave task 6255329 qmaster does not store sharetree usage on shutdown 6253266 failed array tasks are rescheduled only one by one 6253093 qstat -f -pe make breaks 6252524 Missing success message with qconf -Aprj 6252465 qsub option parameter string only supports 2048 character strings 6251943 japi does not work with host aliasing 6251172 reserved jobs prevent other jobs from starting 6247889 qsub -sync y return code behaviour broken 6247239 sequence nr of execd load reports corrupted 6247238 qsub fails to work correctly with -b n -cwd 6247211 qstat -explain E does not print queue errors correctly 6245487 qhost -h does not show selected host 6244865 a series of matching soft queue requests gets not counted separately 6244808 scheduler does not get all objects on a qmaster or scheduler startup 6244229 misleading qstat -j message when the scheduler is not running 6244215 qsub -b y must fail if no command is specified 6242779 qsub -now yes not working on CSP system 6242181 Failed drmaa_control (DRMAA_CONTROL_TERMINATE) causes deadlock 6242172 Multi-threaded args parsing problems 6242165 Profiling library never frees thread slots 6242057 jobs which request consumable resources which are set to infinity are not scheduled 6242055 Consumable request may not be 0 if PE requested 6241544 qstat -F dies in case of a infinit integer setting 6241487 termination script may not be ignored, when job submited with -notify 6241430 error message "no execd known on host" 6241401 Conflicting requirements should have the same meaning with qstat and qsub 6241378 Reservation of wrong hosts 6241376 qstat -U aborts 6240739 qstat -s hu shows pending jobs only 6239660 qmaster profiling doesn't start at qmaster startup 6239569 qmaster does not accept new connections if number of execd's exceed FD_SETSIZE 6239394 Spooledit fails during database upgrade 6236475 DRMAA segfaults with > 255 threads 6236472 qsub -sync y doesn't remove session directories 6236469 JAPI: Can be made to start two event client threads 6236261 BDB install on NFSv4 share 6234836 Need a means to purge host or hostgroup specific cluster queue 6234371 error message from execd about endpoint is not unique 6233162 global scheduler messages are reported multiple times 6232074 load formula is not working for pe jobs 6231366 deadlock in the qmaster due to qconf -k[s|e] 6230846 execd logs error mesage, when a tight pe job in "t" state is deleted 6229373 An array pe job can set queues into error state 6229277 qselect uses sge_qstat file 6229253 a parallel array job can kill the qmaster 6228786 Long delay when starting up large pe jobs 6228350 Execd messages file contains incorrectly-formatted lines 6226085 suspend_interval is ignored when enabling jobs due to suspend_thresholds change 6225570 sharetree has a usage leak 6222930 After shadowd takes over there is a long delay before execd connects to new qmaster 6222861 error message "no execd known on host" 6222811 scheduler can get out of sync 6222237 huge CPU and memory overhead when modifiying complex attributes 6221244 releasing user hold state through qrls may not require manager priviledges 6221231 qsub -sync y return code behaviour broken 6221167 sge_schedd segfaults in case of a restart and a running pe job. 6220060 wrong calendar settings kills the qmaster 6219999 changing of local execd_spool_dir is fault prone 6218430 Problems with load values if execution daemons run in a solaris zone at x86 6215730 qdel failed to delete qrsh (login) job on a Solaris box when Secure Shell is used 6205060 SGE tools segfault when gid can't be looked up 6199256 qconf -[a|A|m|M]stree kills qmaster 6194719 starter_method is ignored with binary jobs that are started without a shell 6186597 qconf error diagnosis broken 6178843 qconf changes to complex doesn't display all the changes made upon exit 5085004 qstat -f -q all.q@HOSTNAME does not resolve hostname (from 118089-03) 6216020 pending job task deletion may not work 6215580 execd messages file contains errors for tight integrated jobs 6211309 qmaster running out of file descriptors 6211243 The qstat -ext -xml command is broken with N1GE6 Update 2 patch 6205648 error in commlib read/write timeout handling (from 118089-02) 6201042 qdel "*" produces error logging in qmaster messages file 6201040 Exit 99 jobs are not rescheduled to hosts where they ran before 6201039 qconf -ks gives bad error message if scheduler isn't running 6201038 reduce the impact of qstat on the overall performance 6201033 qmaster might fail if jobs are deleted which have multiple hold states applied 6200013 arch script does not know about /lib64 6199261 a sharetree delete can kill qmon 6196578 backup failes, when... 6195249 QMON Cluster Queue Window: Heading line words does not match into column width 6194729 Subordinate queue thresholds are not spooled with BDB 6194713 Only first subordinate queue will be suspended at qmaster restart 6194625 subordinate queues consume excessive memory 6194002 sgemaster -migrate on qmaster host tries to start second qmaster 6193866 backup/restore does not work under Linux and others.. 6193361 Jobs fail in case of NFS execd installation on volumes exported without root write priviledges 6193348 qconf -mq does not output the subordinate_list correct 6191366 tightly integrated pe jobs: scheduler doesn't respect usage of pe tasks in sharetree calculation 6190164 too many array tasks are deleted 6189289 a cluster queue can be deleted, even though it is referenced in an other cq 6189286 memory leak in the scheduler with consumables as load thresholds 6185211 Job environments should not include Grid Engine dynamic library path 6185208 qmon and equal job arguments 6185169 qmon returns an error dialog, when editing a calendar 6185136 Job customize shows weird characters for fields, additional fields cannot be added 6184466 scheduler does not look ahead to consider queue calendars state transitions 6184460 qmod -[d|e] cannot handle the folowing qnames: "[0-9]*" 6183365 qconf -sstree gives a SIGBUS error 6180529 meaningless job error state diagnosis text in qstat -j 6176181 qdel "" kills qmaster 6176177 restoring a backup does not restore the job_scripts dir. 6176115 Show qmaster/execd application status in qping 6174915 qconf has wrong exit status 6174821 segmentation fault when vmemsize limit is reached 6174331 Option "-v VAR" does not fetch from envrionment 6174326 qconf -sq displayes "slots" in the complex_values line 6174301 N1GE6: qsub -js and negative job_share numbers acts strangely/unexpectedly. 5108639 qconf -sstree seg faults with large share trees 5108635 $ARCH required in path for qloadsensor and qidle. 5104789 mail sent by qmaster leaves zombie processes 5104270 Cannot add calendar with \ syntax 5102442 qconf -de crashes qmaster 5102320 memory leak in the scheduler, with pe jobs and resource requests 5097732 Need detailed error messages from communication layer 5095907 qacct -l is not working 5094016 o-tickets assigned to departments are ignored 5092487 hard resource requests ignored in parallel jobs 5090162 qmake does not export shell env. vars 5089255 Submit to a queue domain is never scheduled 5089222 scheduling weirdness with wild-card PE's 5086108 wrong message appears when queue instance becomes error state 5085010 qmon customize filter for running jobs does not filter 5075968 Thread enabled commlib coredumps on exit on a 32bit Solaris x86 box (from 118089-01) 5085392 qstat -j -xml generates no parseble xml output 5084317 Invalid job_id's in reporting file (only l24_amd64) 5083115 Need more verbose diagnosis msg if execd port is already bound 5083102 hostgroup changes do not always take effect. 5082490 qstat -ext -urg omits time info 5081839 qconf -ahgrp fails if no hgrp name is specified 5081822 Deleting a queue instance slots value actually adds it 5081821 qstat XML output typo 5080856 QCONF: qconf -mc segfaults 5080853 DRMAA doesn't reject jobs that never will be dispatchable 5080852 qconf -aq @ crashes qmaster 5080851 qalter/qdel/qmod abort 5080840 problems when qconf -mattr is used in conjunction with host_aliases file 5080839 qconf -mq displayes "slots" in the complex_values line 5080836 qhosts outputs NCPU as float 5080833 qconf -mattr dumps core if used incorrectly 5080784 qselect crash 5080779 qconf -de host does not update the host groups 5079572 Resending queue signals broken 5079514 execd shutdown with sgeexecd fails when host aliases are used 5078783 Wallclock time limit in qmon 5077589 schedd and qmaster get out of sync - no scheduling for long time 5077549 qsub -N "@" causes qmaster down 5077165 reprioritize_interval descr in sched_conf(5) needs improvemen 5076491 qmaster clients may not reconnect after qmaster outage 5076372 "|" should be able to be used with qsub -N 5076358 It shuld be used "." and "$" with qsub -N 5075936 qmon's queue filtering doesn't work 5075849 a registering event client can get events before it got its total update 5075451 sched_conf(5) reprioritize_interval should default to 0 5075398 variable syntax : equal sign support 5075346 Sharetree doesn't work correct 5074788 jobs on hold due to -a time cause qmaster/schedd get out of sync 5073218 qconf -aq @ crashes qmaster 5072772 sge_qmaster constantly rewrites spool files of tightly integrated parallel jobs 5072481 Deleted pending job appears in qstat 5072005 drmaa_run_job() may change the current directory 5071987 Qmaster requires a local conf in order to start. 5071918 qmod -e '@' causes segmentation fault in qmaster 5071914 scheduler ignores queue seqno for queue sorting 5071539 qping doesn't support host_aliases file 5071525 qalter abort 5071522 Startup of qmaster changes act_qmaster to `hostname` 5071502 calendars broken 5071498 projects not available after sge_qmaster restart 5063987 qmaster cannot bind port below 1024 on Linux 5063316 PE job submit error, when qmaster is busy 5063311 high memory usage of schedd and qmaster (schedd_job_info) Patch Installation Instructions: -------------------------------- tar.gz Patch Installation: -------------------------- See the patch installation instructions below before installing this patch! Patches in 'tar.gz' format cannot be installed with 'patchadd' on Solaris systems. The patch is installed by unpacking the 'tar.gz' file(s) in this directory in . is usually your directory. The installation of this patch later is not visible with the "showrev -p" command on Solaris. This patch cannot be backed out. You may want to make a backup copy of the files before installing this patch since the files will be overwritten. Please read "Install Instructions" later in this file and carry out all steps before you unpack the 'tar.gz' file(s) included in this patch. This patch in 'tar.gz' format should not be installed if the original package has been installed with 'pkgadd' on Solaris. If the original installation used packages ('pkgadd') utility, install the available patches for N1 Grid Engine 6; refer to the patch matrix below. The patch is installed by user root by unpacking the file(s) in the directory where the original package has been installed: # cd # gzip -dc / | tar xvpf - After installing the patch, you should correct the file permissions if your Sun Grid Engine installation is installed as an "admin user" system: # cd # util/setfileperm.sh Patch requirements and patch matrix for N1 Grid Engine 6 packages ----------------------------------------------------------------- The patches below update a N1 Grid Engine 6 distribution to N1 Grid Engine 6 Update 7 (N1GE 6.0u7). The "-help" output of most commands will print a version string "N1GE 6.0u7" after applying the patch. All packages of a N1 Grid Engine 6 distribution must have the same patch level (exception for ARCo - see requirements for ARCo in the ARCo patches README's). Please refer to the patch matrix below which updates the distribution to most recent patch level. It is not supported and possible to mix different patch levels of binaries and the "common" package in a single N1 Grid Engine cluster. 1. Patches for packages in Sun pkgadd format -------------------------------------------- Package name* OS* Architecture* Patch-Id ----------------------------------------------------------------- SUNWsgee Solaris, Sparc, 32bit sol-sparc 118094-07 SUNWsgeex Solaris, Sparc, 64bit sol-sparc64 118130-07 SUNWsgeex Solaris x86 sol-x86 118131-07 SUNWsgeeax Solaris, x64 (AMD64) sol-amd64 120438-03 SUNWsgeec all common 118132-07 SUNWsgeea all arco 118133-05 SUNWsgeed all doc 119846-02 *Package Name = see pkginfo(1) *OS = Operating system *Architecture = N1 Grid Engine binary architecture string or "common" = architecture independent packages "arco" = Accounting and Reporting console "doc" = PDF documentation "gemm" = Grid Engine Management Module for Sun Control Station (SCS) (tar.gz only) 2. Patches for packages in tar.gz format ---------------------------------------- OS* Architecture Patch-Id ----------------------------------------------------- Solaris, Sparc, 32bit sol-sparc 118082-07 Solaris, Sparc, 64bit sol-sparc64 118083-07 Solaris, x86 sol-x86 118084-07 Solaris, x64 (AMD64) sol-amd64 120439-03 Linux kernel2.4/2.6, x86 lx24-x86 118085-07 Linux kernel2.4/2.6, AMD64 lx24-amd64 118086-07 IBM AIX 4.3 aix43 118087-07 IBM AIX 5.1 aix51 118088-07 Apple MAC OS/X darwin 118089-07 HP HP-UX 11 hp11 118090-07 SGI Irix 6.5 irix65 118091-07 Microsoft Windows win32-x86 120434-03 all common 118092-07 all arco 118093-05 all doc 119861-02 Solaris, Linux gemm 120435-02 Special Install Instructions: ----------------------------- Content ------- Patch Installation Stopping the N1 Grid Engine cluster to start jobs Shutting down the N1 Grid Engine daemons Installing the patch and restarting the software New functionality delivered since N1GE 6.0 Update 4 Reworked "qstat -xml" output Reworked PE range matching algorithm in the scheduler New monitoring feature in qmaster New parameter for specialized job deletion New reporting parameter to control accounting file flush time New functionality delivered since N1GE 6.0 Avoid setting of LD_LIBRARY_PATH DRMAA Java[TM] language binding delivered with this patch New qstat options to optimize memory overhead and speed of qstat Tuning parameter for sharetree spooling New "qconf -purge" option Berkeley DB spooling on NFSv4 under Solaris 10 supported Execd installation in Solaris 10 zones supported Faster execution daemon reconnect in CSP mode Berkeley DB database tools are included in the distribution Patch Installation ------------------ NOTE: This patch requires that you update your Berkeley DB database files if you are upgrading from N1GE 6.0u1 or 6.0. Please read the full notes when applying this patch. These installation instructions assume that you are running a homogenous N1 Grid Engine cluster (called "the software") where all hosts share the same directory for the binaries. If you are running the software in a heterogenous environment (mix of different binary architectures), you need to apply the patch installation for all binary architectures as well as the "common" and "arco" packages. See the patch matrix above for details about the available patches. If you installed the software on local filesystems, you need to install all relevant patches on all hosts where you installed the software locally. By default, there should by no running jobs when the patch is installed. There may pending batch jobs, but no pending interactive jobs (qrsh, qmake, qsh, qtcsh). It is possible to install the patch with running batch jobs. To avoid a failure of the active 'sge_shepherd' binary, it is necessary to move the old shepherd binary (and copy it back prior to the installation of the patch). You can not install the patch with running interactive jobs, 'qmake' jobs or with running parallel jobs which use the tight integration support (control_slaves=true in PE configuration is set). A. Stopping the N1 Grid Engine cluster to start jobs ---------------------------------------------------- Disable all queues so that no new jobs are started: # qmod -d '*' Optional (only needed if there are running jobs which should continue to run when the patch is installed): # cd $SGE_ROOT/bin # mv /sge_shepherd /sge_shepherd.sge60 It is important that the binary is moved with the "mv" command. It should not be copied because this could cause the crash of an active shepherd process which is currently running job when the patch is installed. B. Shutting down the N1 Grid Engine daemons ------------------------------------------- You need to shutdown (and restart) the qmaster and scheduler daemon and all running execution daemons. Shutdown all your execution hosts. Login to all your execution hosts and stop the execution daemons: # /etc/init.d/sgeexecd softstop Then login to your qmaster machine and stop qmaster and scheduler: # /etc/init.d/sgemaster stop Now verify with the 'ps' command that all N1 Grid Engine daemons on all hosts are stopped. If you decided to rename the 'sge_shepherd' binary so that running jobs can continue to run during the patch installation, you must not kill the 'sge_shepherd' binary (process). C. Installing the patch and restarting the software --------------------------------------------------- Now install the patch by installing the patch with "patchadd" or by unpacking the 'tar.gz' files included in this patch as outlined above. Berkeley DB database update needed ---------------------------------- NOTE: This update is *not* needed if you already installed N1GE 6.0u3 or higher. The update is only needed if you are upgrading from N1GE 6.0u2 or earlier. After installing this patch, and before restarting your cluster you need to update your Berkeley DB (BDB) database in the following cases: - you choose the BDB spooling option (not needed for classic spooling) either locally or with the BDB RPC option, and you are upgrading your cluster for N1 Grid Engine 6.0 or 6.0u1 to N1 Grid Engine 6.0u2 or higher 1. For safety reasons, please make a full backup of your existing configuration. To perform a backup use this command % inst_sge -bup 2. Upgrade your BDB database. This is done as follows: % inst_sge -updatedb Restarting the software ----------------------- Please login to your qmaster machine and execution hosts and enter: # /etc/init.d/sgemaster # /etc/init.d/sgeexecd After restarting the software, you may again enable your queues: # qmod -e '*' If you renamed the shepherd binary, you may safely delete the old binary when all jobs which where running prior the patch installation have finished. New functionality delivered since N1GE 6.0 Update 4 --------------------------------------------------- 1. Reworked "qstat -xml" output ------------------------------- The schema for "qstat -xml" and the "qstat -xml" output have been reworked to ensure consistency between them and easy parsing of them via JAXB. The most noticeable change will the date output. It follows now the XML datetime format. 2. Reworked PE range matching algorithm in the scheduler -------------------------------------------------------- The PE range matching algorithm is now adaptable and learns from the past decisions. This will lead to a much faster scheduling decision in case of pe-ranges. This can be controlled by a new scheduling configuration parameter: SELECT_PE_RANGE_ALG. It allows to restore the old behavior. See sge_conf(5) for more information. 3. New monitoring feature in qmaster ------------------------------------ The monitoring allows to get detailed statistics what the qmaster threads are doing and how busy they are. The statistics can be accessed via "qping -f" or from the qmaster messages file. The feature is controlled by two qmaster configuration parameters: MONITOR_TIME specifying the time interval for the statistics LOG_MONITOR_MESSAGE enables/ disables the logging of the monitoring messages into the qmaster messages file. See sge_conf(5) for more information. 4. New parameter for specialized job deletion --------------------------------------------- A new "execd_param" (configured in the global cluster configuration): ENABLE_ADDGRP_KILL=true can be configured to enable addition code within the execution host to delete jobs. If this parameter is set then the supplementary group id's are used to identify all processes which are to be terminated when a job should be deleted. It has only effect for following architectures: sol* lx* osf4 tru64 See sge_conf(5) under "gid_range" for more information. 5. New reporting parameter to control accounting file flush time ---------------------------------------------------------------- A new reporting parameter, "accounting_flush_time", controls the flush period for the accounting file. Previously, both the accounting and reporting files were flush at the same interval. Now they can be set independently. Additionally, buffering of the accounting file can now be disabled, allowing accounting data to be written to the accounting file as soon as it becomes available. See sge_conf(5) for more information. New functionality delivered since N1GE 6.0 ------------------------------------------- 1. Avoid setting of LD_LIBRARY_PATH; inherited job environment -------------------------------------------------------------- There are two new "execd_params" (defined in the global or local cluster configuration) which control the environment inherited by a job: SET_LIB_PATH INHERIT_ENV By default, SET_LIB_PATH is false and INHERIT_ENV is true. If SET_LIB_PATH is true and INHERIT_ENV is true, each job will inherit the environment of the shell that started the execd, with the N1GE lib directory prepended to the lib path. If SET_LIB_PATH is true and INHERIT_ENV is false, the environment of the shell that started the execd will not be inherited by jobs, and the lib path will contain only the N1GE lib directory. If SET_LIB_PATH is false and INHERIT_ENV is true, each job will inherit the environment of the shell that started the execd with no additional changes to the lib path. If SET_LIB_PATH is false and INHERIT_ENV is false, the environment of the shell that started the execd will not be inherited by jobs, and the lib path will be empty. Environment variables which are normally overwritten by the shepherd, such as PATH or LOGNAME, are unaffected by these new parameters. 2. DRMAA Java[TM] language binding delivered with this patch ------------------------------------------------------------ The DRMAA Java language binding is now available. The DRMAA Java language binding library and documentation is contained in the patch for the "common" package. 3. New qstat options to optimize memory overhead and speed of qstat ------------------------------------------------------------------- The qstat client command has been enhanced to reduce the overall amount of memory which is requested from the qmaster. To enable these changes it is necessary to change the qstat default behavior. This is possible by defining a cluster-global or user-specific sge_qstat file. More information can be found in sge_qstat(5) manual page. In addition two new qstat options ("-u" and "-s") have been introduced to be used with the sge_qstat default file. Find more information in qstat(1). 4. Tuning parameter for sharetree spooling ------------------------------------------ A new "qmaster_param" (configured in the global cluster configuration): STREE_SPOOL_INTERVAL=