Patch-ID# 106454-01 Keywords: solstice backup networker Synopsis: Solstice Backup 5.1_x86: Product Patch Date: Jul/09/99 Solaris Release: 2.5_x86 2.5.1_x86 2.6_x86 SunOS Release: 5.5_x86 5.5.1_x86 5.6_x86 Unbundled Product: Solstice Backup Unbundled Release: 5.1_x86 Relevant Architectures: i386 BugId's fixed with this patch: 4086246 4117627 4133424 4137841 Changes incorporated in this version: Patches accumulated and obsoleted by this patch: Patches which conflict with this patch: Patches required with this patch: Obsoleted by: Files included with this patch: /usr/sbin/nsr/ansrd /usr/sbin/nsr/jb_config /usr/sbin/nsr/jbexercise /usr/sbin/nsr/mminfo /usr/sbin/nsr/nsralist /usr/sbin/nsr/nsrarchive /usr/sbin/nsr/nsrcap /usr/sbin/nsr/nsrclone /usr/sbin/nsr/nsrd /usr/sbin/nsr/nsrexec /usr/sbin/nsr/nsrexecd /usr/sbin/nsr/nsrindexd /usr/sbin/nsr/nsrjb /usr/sbin/nsr/nsrlmc /usr/sbin/nsr/nsrlic /usr/sbin/nsr/nsrmmd /usr/sbin/nsr/nsrmon /usr/sbin/nsr/nsrstage /usr/sbin/nsr/nsrtrap /usr/bin/nsr/nwbackup /usr/bin/nsr/nwrecover /usr/bin/nsr/pstclntsave /usr/bin/nsr/recover /usr/bin/nsr/save /usr/sbin/nsr/savefs /usr/sbin/nsr/savegrp /usr/sbin/nsr/scanner /usr/sbin/nsr/tapeexercise /usr/bin/nsr/nwretrieve Problem Description: o nsrmmd repeats message "Diagnostic: recvfd(4) fails, errno=6" and consumes all available CPU cycles. Initial diagnosis was that the Networker server was also a storage node and there was contention between server managed nsrmmd processes and nsrmmd processes managed by the storage node. However, the errors persisted even when the system was only a server and no longer a storage node. The errors occurred because the set_mm_control function is called only for nsrmmd processes that are not local to the server. Set_mm_control is now called for nsrmmd processes controlled by the server and the storage node. o Issuing commands to a jukebox attached to a storage node fails. If multiple nsrmon requests are pending for a single nsrmmd, and the first nsrmon request did not terminate properly due to a time out, all subsequent requests to nsrmmd will fail. The following error message appears in the daemon.log: "nsrmon : auth failed: invalid storage node proposed: ". NOTE: This error message can still occur and is typically caused by network related problems. o Recover does not use the "Server network interface" field defined in the client's resource record. This may cause recover data not to traverse the desired physical network between the client and the server. Recover now uses the "Server network interface" field. o The correct browse time does not display in the nwrecover window when a valid browse time is entered in the 'Change Browse Time' window. o CPU utilization goes up 7% per nwadmin. When nwadmin refreshes, it polls nsrd which queries for storage nodes and high speed devices. This query can be CPU intensive when there are a large number of storage nodes and high speed devices. This information is now cached. 4086246 Using nwbackup, a directory is marked for backup. If the directory is then collapsed and expanded, it is no longer marked for backup. The directory is not displayed or recoverable from nwrecover. o Self-id licensing checks are calculated every time they are needed. Self-id licensing checks are costly to the server and can degrade performance on servers with a large number of clients. Client license information is now stored. o During large savegrps, the client must re-authenticate with the server. Extremely active networks can cause re-authentication to fail and the savegrp to abort due to the network traffic. Changes have been made to re-authentication code to allow retires, when a failure is received. o Storage node nsrmmd's may be restarted unnecessarily. When polling events for multiple storage nodes occur at the same time, the nsrmon processes update the 'nsrmon info' attribute. This update can happen at the same time causing nsrd to receive an old and incorrect value for the attribute. This results in sequence number mismatch errors to the wrong caller and restarting of nsrmmd's. o Nsrindexd would use 100% of the CPU when a savegroup was started. The savegroup had over 150 clients and the nsr.res file was large. When the savegroup was started, the nsr.res file was opened for every client in the group. Nsrindexd was spending most of its time parsing and allocating the in-core version of the database. Nsrindexd was changed to open the index and keep it open. o Nsrd core dumps during startup if the server has more than 100 clients configured. The memory re-allocation route was changed and the check to expand the client name list happens at the 100th client, instead of the 101st. o Save has been executed with the -x option to specifically cross mount points. Recover and nwrecover do not list the directory entry for the mounted filesystem in the browse list. When listing files in recover, with ls, the directory entry for the mounted filesystem doesn't appear in the browse list. The directory can be added to the recover list and restored, even though the directory entry doesn't appear in the browse list. If the directory is added to the recover list, the directory entry will then appear in the browse list during the next listing of files. Nwrecover does not display the directory entry for the mounted filesystem in the browse list. The directory entry will appear if the absolute path is specified in the selection field. If a command line recover is performed, the directory can be added to the recover list and successfully restored. o Creating index entries with 'scanner -i' fails. Rebuilding index entries using 'scanner -i', with a tape that has valid data, fails with the following SET of errors: write failed, Broken pipe fn # rn # read error Bad file number ssid #: NOT complete ssid #: # MB, # file(s) done with dlt tape volume.name error, rewind Bad file number An internal file descriptor becomes corrupted when scanner spans tapes. Once the file descriptor entry is corrupted, the child scanner is unable to read data from the pipe. The child then hangs and the parent tries to write data to the pipe. The write fails because the file descriptor no longer points to the child. The parent then tries to shutdown the saveset processing and closes the pipe. Which causes the "broken pipe" error. Scanner then tries to rewind the tape, but is unable to because of the bad file descriptor. 'Scanner -i' fails when rebuilding the indexes on large savesets. File descriptors were not closed, causing files to remain open. In the case of large savesets, many files remained open, preventing child processes from exiting. This has been one of the causes for the error "Pipe: too many open files" on HPUX systems. o Trap received from NetWorker has incorrect values for IPADDRESS and TIMETICKS. o If an I/O error occurs when NetWorker is trying to read a tape, NetWorker may go into a loop continuously trying to read the same tape or NetWorker will try to read the tape and eventually disable the drive. Even though NetWorker detected an error with the tape, the volume was never marked 'full'. If a drive is marked read only and a backup is started, NetWorker will loop trying to load and unload the volume in that drive. o Additional checking has been added to nsrd. NetWorker now verifies proper case syntax for pool names. If a pool name is typed with the wrong case, an error message is returned. ie: # save -s alain -b "DEFAULT" /tmp save: RAP error, There is no pool named `DEFAULT'. save: Cannot open save session with alain The pool name should have been "default", all lowercase letters. Previously, NetWorker would wait for a tape in pool DEFAULT to become available. 08/06/98 10:28:16 nsrd: media critical event: backup to pool 'DEFAULT' waiting for 1 writable backup tape(s) o Nsrstage would abort the stagging operation if a ssid/cloneid of an incomplete save set was specified. Nsrstage has been changed to skip the specified incomplete save set, instead of aborting. o Ssi module fails on Solaris 2.6 with an STK silo. Starting the ssi module (/usr/sbin/nsr/ssi) fails. Ssi_event.log contains the errors: 03-20-98 15:58:14 SSI[O] ONC RPC:csi_init(): Initiation Started 03-20-98 15:58:14 SSI[0] Found transient program number in rpctinit : 40000000 03-20-98 15:58:14 SSI[0] Found transient program number in rpctinit : 40000000NC RPC: csi_rpctinit(): status: STATUS_NI_FAILURE; failed: getsockname() 03-20-98 15:58:14 SSI[O]: RPC: csi_main(): status:STATUS_PROCESS_FAILURE; failed: main() Initiation of CSI Failed; o Nsrjb is unable to import a tape via the import/export door on ATL jukeboxes. Nsrjb -d will error with " source component empty" or "invalid request, invalid element number". The green light to load a tape never comes on. o When a device becomes disabled due the the error count being reached, nwadmin doesn't display the drive as disabled. o Restore of data fails on HP10x systems that performed backups with immediate save enabled. The data backed up with immediate save enabled on an HP10x system is compromised and full backups should be performed to ensure valid backups. o The post processing commands executed by savepnpc do not always execute when a time is specified with 'timeout'. The failure has been seen on systems with a heavy load when post processing should begin. Pstclntsave has been modified to handle time comparisons on loaded systems. o Some systems that backup extremely compressed data have experienced gaps in data being transferred. Data is not being lost, but a performance degradation has been noticed. The gaps in data transfer are caused by writing file marks synchronously to the tape. NetWorker will block when the buffers are flushed and the file mark is written, causing data transfer gaps on some systems. File marks between data will now be written asynchronously. The end of data file mark will still be written synchronously, to ensure the jukebox won't try to eject the tape before all the buffers have been flushed. o Scheduled cloning selects savesets that have already been cloned, if the save sets have a date in the future. o When save processes a named pipe it incorrectly identifies it as a mount point because the device number of the named pipe is different from the filesystem containing the named pipe. This will often cause recover browse time problems and recover failures. Sometimes the file can be recovered using command line recover specifying the time stamp, or using interactive recover and explicitly add the named pipe to the recover list. A new save binary has been produced to address this issue. o Changing the mode of a volume to recyclable, does not mark the volume full or read-only. If the volume that had it's mode changed to recyclable was mounted and a backup started, the data from the backup would have been appended to the tape. 4133424 Running 'mminfo -r volid' on a backup that had been performed within 24 hours will cause mminfo to core dump. o Pstclntsave can't determine when a savegrp has been aborted. o Savegrp is unable to backup >27 save sessions in parallel and receives the error "savegrp: Couldn't create log file /nsr/tmp/sg.grp_name.srvr_name.(xxxxx for client backup". 4117627 Jb_config will only accept the default device name. Entering a device name other than the default will result in an error and cause jb_config to loop prompting for another device name. o jb_config display the error: Pathname 'entered_dev_name' is not valid. Please try again. Enter pathname of media drive 4137841 Configuring a SILO with jb_config fails with: jb_config: error, Unknown error There appears to be something incorrect about your jukebox installation. Please consult your device's documentation to help you troubleshoot your installation. o The relabeling of a jukebox tape fails with the error: nsrd: media alert event: Jukebox 'name' failed update handler got seq. number 'xxx', should be 'yyy' o Some old-style SBU Network Edition enablers are recognized as a SBU Single Server base enabler. If the system has been configured with >1 client and >1 device, nwadmin will error with "Server is disabled Too many devices". When a system has been changed to a Single Server, the server is reset to default resources, disallowing new resources to be created, removes custom resources, and may disable the server. o Nsrstage or savegrp fails on a file type device when the seek reaches 2.3GB - 3GB. Errors reported by nsrstage or savegrp: nsrd: media info: cannot seek to record 'xxx' file 'yyy' on /file_type_device: Invalid argument or nsrd: media: info: cannot seek on /file_type: Invalid argument o Optimization of client licensing code. Patch Installation Instructions: -------------------------------- Refer to the Install.info file for instructions on using the generic 'installpatch' and 'backoutpatch' scripts provided with each patch. Any other special or non-generic installation instructions should be described below as special instructions. Special Install Instructions: ----------------------------- None.