Solaris and Linux: Solaris Notes

Solaris SPARC Boot Sequence

The following represents a summary of the boot process for a Solaris 2.x system on Sparc hardware.
• Power On: Depending on the system involved, you may see some output on a serial terminal immediately after power on. This may take the form of a Hardware Power ON message on a large Enterprise server, or a "'" or "," in the case of an older Ultra system. These indications will not be present on a monitor connected directly to the server.

• POST: If the PROM diag-switch? parameter is set to true, output from the POST (Power On Self Test) will be viewable on a serial terminal. The PROM diag-level parameter determines the extent of the POST tests. (See the Hardware Diagnostics page for more information on these settings.) If a serial terminal is not connected, a prtdiag -v will show the results of the POST once the system has booted. If a keyboard is connected, it will beep and the keyboard lights will flash during POST. If the POST fails, an error indication may be displayed following the failure.

• Init System: The "Init System" process can be broken down into several discrete parts:
o OBP: If diag-switch? is set, an Entering OBP message will be seen on a serial terminal. The MMU (memory management unit) is enabled.
o contain information about boot devices, especially where the boot disk has been encapsulated with VxVM or DiskSuite.
o Probe All: This includes checking for SCSI or other disk drives and devices.
o Install Console: At this point, a directly connected NVRAM: If use-nvramrc? is set to true, read the NVRAMRC. This may monitor and keyboard will become active, or the serial port will become the system console access. If a keyboard is connected to the system, the lights will flash again during this step.
o Banner: The PROM banner will be displayed. This banner includes a logo, system type, PROM revision level, the ethernet address, and the hostid.
o Create Device Tree: The hardware device tree will be built. This device tree can be explored using PROM monitor commands at the ok> prompt, or by using prtconf once the system has been booted.

• Extended Diagnostics: If diag-switch? and diag-level are set, additional diagnostics will appear on the system console.
• auto-boot?: If the auto-boot? PROM parameter is set, the boot process will begin. Otherwise, the system will drop to the ok> PROM monitor prompt, or (if sunmon-compat? and security-mode are set) the > security prompt.
The boot process will use the boot-device and boot-file PROM parameters unless diag-switch? is set. In this case, the boot process will use the diag-device and diag-file.
• bootblk: The OBP (Open Boot PROM) program loads the bootblk primary boot program from the boot-device (or diag-device, if diag-switch? is set). If the bootblk is not present or needs to be regenerated, it can be installed by running the installboot command after booting from a CDROM or the network. A copy of the bootblk is available at /usr/platform/`arch -k`/lib/fs/ufs/bootblk
• ufsboot: The secondary boot program, /platform/`arch -k`/ufsboot is run. This program loads the kernel core image files. If this file is corrupted or missing, a bootblk: can't find the boot program or similar error message will be returned.
• kernel: The kernel is loaded and run. For 32-bit Solaris systems, the relevant files are:
1. /platform/`arch -k`/kernel/unix
2. /kernel/genunix
For 64-bit Solaris systems, the files are:
3. /platform/`arch -k`/kernel/sparcV9/unix
4. /kernel/genunix
As part of the kernel loading process, the kernel banner is displayed to the screen. This includes the kernel version number (including patch level, if appropriate) and the copyright notice.
The kernel initializes itself and begins loading modules, reading the files with the ufsboot program until it has loaded enough modules to mount the root filesystem itself. At that point, ufsboot is unmapped and the kernel uses its own drivers. If the system complains about not being able to write to the root filesystem, it is stuck in this part of the boot process.
The boot -a command singlesteps through this portion of the boot process. This can be a useful diagnostic procedure if the kernel is not loading properly.
• /etc/system: The /etc/system file is read by the kernel, and the system parameters are set.
The following types of customization are available in the /etc/system file:
o moddir: Changes path of kernel modules.
o forceload: Forces loading of a kernel module.
o exclude: Excludes a particular kernel module.
o rootfs: Specify the system type for the root file system. (ufs is the default.)
o rootdev: Specify the physical device path for root.
o set: Set the value of a tuneable system parameter.
If the /etc/system file is edited, it is strongly recommended that a copy of the working file be made to a well-known location. In the event that the new /etc/system file renders the system unbootable, it might be possible to bring the system up with a boot -a command that specifies the old file. If this has not been done, the system may need to be booted from CD or network so that the file can be mounted and edited.
• kernel initialized: The kernel creates PID 0 ( sched). The sched process is sometimes called the "swapper."
• init: The kernel starts PID 1 (init).
• init: The init process reads the /etc/inittab and /etc/default/init and follows the instructions in those files.
Some of the entries in the /etc/inittab are:
o fs: sysinit (usually /etc/rcS)
o is: default init level (usually 3, sometimes 2)
o s#: script associated with a run level (usually /sbin/rc#)
• rc scripts: The rc scripts execute the files in the /etc/rc#.d directories. They are run by the /sbin/rc# scripts, each of which corresponds to a run level. Debugging can often be done on these scripts by adding echo lines to a script to print either a "I got this far" message or to print out the value of a problematic variable.
Solaris Kernel Tuning
sysdef -i reports on several system resource limits. Other parameters can be checked on a running system using adb -k :
adb -k /dev/ksyms /dev/mem
parameter-name/D
^D (to exit)
More information on kernel tuning is available in Sun's online documentation.

maxusers
The maxusers kernel parameter is the one most often tuned. By default, it is set to the number of Mb of physical memory or 1024, whichever is lower. It cannot be set higher than 2048.
Several kernel parameters are set when maxusers is set unless otherwise overridden by the /etc/system file. Some of these formulas differ between different versions of Solaris:
• max_nprocs: Number of processes = 10 + (16 x maxusers)
• ufs_ninode: Inode cache size = (17xmaxusers)+90 (Solaris 2.5.1) or 4x(maxusers + max_nprocs)+320 (Solaris 2.6-8). See the Disk I/O page for more information.
• ncsize: Name lookup cache size = (17xmaxusers)+90 (Solaris 2.5.1) or 4x(maxusers + max_nprocs)+320 (Solaris 2.6-8). See the Disk I/O page for more information.
• ndquot: Quota table size = (maxusers x 10) + max_nprocs
• maxuproc: User process limit = max_nprocs - 5
ptys
Solaris 8 dynamically sizes the number of ptys available to a system, so you are less likely to run into pty starvation than was the case under Solaris 2.5.1-7. There are still hard system limits that are set based upon hardware configuration, and it may be necessary to increase the number of ptys manually as in Solaris 2.5.1-7.
If the system is suffering from pty starvation, the number of ptys available can be increased by increasing pt_cnt above the default of 48. Solaris 2.5.1 and 2.6 systems should not have pt_cnt set higher than 3844 due to limitations with the telnet and rlogin daemons. Solaris 7 does not have this restriction, but there may be other system issues that prevent setting pt_cnt arbitrarily high. Once pt_cnt is increased, a reconfiguration boot (boot -r) is required to build the ptys.
If pt_cnt is increased, some sources recommend that other variables be set at the same time. Other sources (such as the Solaris2 FAQ) suggest that this advice is spurious and results in a needless consumption of resources. See the notes below before making any of these changes; setting the values too high may result in wasted memory. In any case, one form of these recommendations is:
• npty: Set to pt_cnt (see the note below)
• nautopush: Set to twice the value of pt_cnt
• sadcnt: Set to same value as pt_cnt

npty limits the number of BSD ptys. These are not usually used by applications, but may need to be increased on a system running a special service. In addition to setting npty in the /etc/system file, the /etc/iu.ap file will need to be edited to substitute the value npty-1 in the third field of the ptsl line. After both changes are made, a boot -r is required for the changes to take effect. Note that Solaris does not support any more than 176 BSD ptys in any case.
sadcnt sets the number of STREAMS addressable devices and nautopush sets the number of STREAMS autopush entries. nautopush should be set to twice sadcnt. Whether or not these values need to be increased as above depends on the types of activity on the system.
RAM Tuneables
See the Memory/Swapping page for a discussion of parameters related to RAM and paging.
Disk I/O Tuneables
See the Disk I/O page for a full discussion of disk I/O-related tuneables.
File Descriptors
See the File Descriptors page for more discussion regarding tuning issues.
File descriptors are retired when the file is closed or the process terminates. Opens always choose the lowest-numbered file descriptor available. Available file descriptors are allocated as follows:
• rlim_fd_cur: It is dangerous to set this value higher than 256 due to limitations with the stdio library. If programs require more file descriptors, they should use setrlimit directly.
• rlim_fd_max: It is dangerous to set this value higher than 1024 due to limitations with select. If programs require more file descriptors, they should use setrlimit directly.
Misc Tuneables
• dump_cnt: Size of dumps.
• rstchown: Posix/restricted chown enabled (default=1)
• ngroups_max: Maximum number of supplementary groups per user (default=32).

CPU Loading
A general rule of thumb is that load averages that are persistently above 4 times the number of CPUs will result in sluggish performance. The load averages can be monitored intermittently via uptime or over extended time periods via sar -u.
One issue to watch for is the number of processes that are blocked while waiting for I/O. Check the disk I/O page for information on monitoring this.
For non-NFS servers, another danger sign is when the system consistently spends more time in sys than usr mode. (nfsd operates in the kernel in sys mode.)
Another issue to watch for is a high number of system calls per second per processor. With today's faster CPUs, 20,000 would represent a reasonable threshold. This can be monitored via sar -c. In particular, the large numbers of forks or execs may represent excessive context switching. (Slower processors will be able to handle fewer system calls per second.) Context switching is monitored by vmstat or mpstat.

Disk I/O
The primary tool to use in troubleshooting disk I/O problems is iostat.
In particular, use iostat -xn 30 during busy times to look at the I/O characteristics of your devices. Ignore the first bunch of output (the first group of output is summary statistics), and look at the output every 30 seconds. If you are seeing svc_t (service time) values of more than 30 ms on disks that are in use (more than, say, 10% busy), then the end user will see noticeably sluggish performance.
If a disk is more than 60% busy over sustained periods of time, this can also indicate overuse of that resource.
If iostat consistently reports %w > 5, the disk subsystem is too busy. In this case, one thing that can be done is to reduce the size of the wait queue by setting sd_max_throttle to 64. (This is obviously a temporary solution, and one of the permanent remedies below needs to be implemented.) Another possible cause is SCSI starvation where low SCSI ID devices receive a lower precedence than a higher-numbered device (such as a tape drive). (See the System Bus/SCSI page for more information.)
Another indication of trouble with the disk I/O subsystem is when the procs/b section of vmstat persistently reports a number of blocked processes that is comparable to the run queue (procs/r). (The run queue is roughly comparable to the load average.)

Disk I/O can be investigated to find out whether it is primarily random or sequential. If sar -d reports that (blks/s)/(r+w/s) < 16Kb (~32 blocks), the I/O is predominantly random. If the ratio is > 128Kb (~256 blocks), it is predominantly sequential. This analysis may be useful when examining alternative disk configurations.
The usual solutions to a disk I/O problem are:
• Check filesystem kernel tuning parameters to make sure that DNLC and inode caches are working appropriately. (See "Filesystem Caching" below.)
• Spread out the I/O traffic across more disks by either striping the filesystem (using DiskSuite or VxVM) or by splitting up the data across additional filesystems on other disks, or even across other servers. (In extreme cases, you can even consider striping data over only the outermost cylinders of several otherwise empty disk drives in order to maximize throughput.) Cockroft recommends 128KB as a good stripe width for most applications.
• Redesign the problematic process to reduce the number of disk I/Os. (Caching is one frequently-used strategy, either via cachefs or application-specific caching.)
• The write throttle can be adjusted to provide better performance if there are large amounts of sequential write activity. The parameters in question are ufs:ufs_HW and ufs:ufs_LW. These are very sensitive and should not be adjusted too far at one time. When ufs_WRITES is set to 1 (default), the write throttle is enabled. When the number of outstanding writes exceeds ufs_HW, writes are suspended until the number of outstanding writes drops below ufs_LW. Both can be increased where large amounts of sequential writes are occurring.
• tune_t_fsflushr sets the number of seconds after which fsflush will run autoup dictates how frequently each bit of memory is checked. Setting fsflush to run less frequently can also reduce disk activity, but it does run the risk of losing data that has been written to memory. These parameters can be adjusted using adb while looking for an optimum value, then set the values in the /etc/system file.
• Check for SCSI starvation, i.e., for busy high-numbered SCSI devices (such as tape drives) that have a higher priority than lower-numbered devices.
• Database I/O should be done to raw disk partitions or direct unbuffered I/O.

Filesystem Caching
There are several types of cache used by the Solaris filesystems to cache name and attribute lookups. These are:
• DNLC (Directory Name Lookup Cache): This cache stores the directory lookup information for files whose paths are sufficiently short (30 characters or less), preventing the need to perform directory lookups on the fly. (Solaris 7 and 8 have removed the file length restriction.)
• inode cache: This cache stores attribute information about files in memory (size, access time, etc). It is a linked list that stores the inodes and pointers to all pages that are part of that file and are currently in memory.
• rnode cache: This is similar to the inode cache, but is maintained on NFS clients to store information about NFS-mounted files.
• buffer cache: The buffer cache stores inode, indirect block and cylinder group-related disk I/O.
(Note that cache statistics will be skewed by things that walk the directory tree like find.)
Directory Name Lookup Cache
The DNLC stores directory lookup information for files whose names are shorter than 30 characters. (The restriction on file name length was lifted in Solaris 7 and 8.)
sar -a reports on the activity of this cache. In this output, namei/s reports the name lookup rate and iget/s reports the number of directory lookups per second. Note that an iget is issued for each component of a file's path, so the hit rate cannot be calculated directly from the sar -a output. The sar -a output is useful, however, when looking at cache efficiency in a more holistic sense.
For our purposes, the most important number is the total name lookups line in the vmstat -s output. This line reports a cache hit percentage. If this percentage is not above 90%, the DNLC should be resized.
DNLC size is determined by the ncsize kernel parameter. By default, this is set to (17xmaxusers)+90 (Solaris 2.5.1) or 4x(maxusers + max_nprocs)+320 (Solaris 2.6-8). It is not recommended that it be set any higher than a value which corresponds to a maxusers value of 2048.
(Note that the AnswerBooks and Cockroft report the incorrect algorithm for ncsize and ufs_ninode. The above formula comes from Sun's kernel support group.)

To set ncsize, add a line to the /etc/system as follows:
set ncsize=10000
The DNLC can be disabled by setting ncsize to a negative number (Solaris 2.5.1-7) or a non-positive number (Solaris 8).
Inode Cache
The inode cache is a linked list that stores the inodes that have been accessed along with pointers to all pages that are part of that file and are currently in memory.
sar -g reports %ufs_ipg, which is the percentage of inodes that were overwritten while still having active pages in memory. If this number is consistently nonzero, the inode cache should be increased. It is usually the case that this number (ufs_ninode) is set to the same value as ncsize. Like ncsize, ufs_ninode is set to (17xmaxusers)+90 (Solaris 2.5.1) or 4x(maxusers + max_nprocs)+320 (Solaris 2.6-8) unless otherwise specified in the /etc/system file. As with ncsize, it is not recommended that ufs_ninode be set any higher than a value which corresponds to a maxusers value of 2048.
The vmstat -s command also contains summary information about the inode cache in the inode_cache section. Among other things, this section includes sizing and hit rate information.
(The inode cache can grow beyond the ufs_ninode limit. When this happens, unused inodes will be flushed from the linked list.)
netstat -k also reports on inode cache statistics.
While resizing the inode cache, it is important to remember that each inode will use about 300 bytes of kernel memory. Check your kernel memory size (perhaps with sar -k) when resizing the cache. Since ufs_ninode is just a limit, it can be resized on the fly with adb.
Rnode Cache
The information in the rnode cache is similar to that from the inode cache, except that it is maintained for NFS-mounted files. The default rnode cache size is 2xncsize, which is usually sufficient. Rnode cache statistics can be examined in the rnode_cache section of netstat -k.

Buffer Cache
The buffer cache is used to store inode, indirect block and cylinder group-related disk I/O. The hit rate on this cache can be discovered by examining the biostat section of the output from netstat -k and comparing the buffer cache hits to the buffer cache lookups. This cache acts as a buffer between the inode cache and the physical disk devices.
Sun suggests tuning bufhwm in the /etc/system file if sar -b reports less than 90% hit rate on reads or 65% on writes.
Cockroft notes that performance problems can result from allowing the buffer cache to grow too large, resulting in kernel memory allocation starvation. The default setting for bufhwm allows the buffer to consume up to 2% of system memory, which may be excessive. The buffer cache can probably be limited to a few Mb safely by setting bufhwm in the /etc/system file:
set bufhwm=8000
Obviously, the effects of such a change should be examined by checking the buffer cache hit rate with netstat -k. Buffer cache statistics are also reported by sar -b.
Physical Disk Layout
The disk layout for a hard drive includes the following:
• bootblock
• superblock: Superblock contents can be reported via the fstyp -v /dev/dsk/* command.
• inode list: The number of inodes for a filesystem is calculated based upon a presumption of an average file size of ~2 KB. If this is not a good assumption, the number of inodes can be set via the newfs -i or mkfs command.
• data blocks

Inodes
Each inode contains the following information:
• file type, permissions, etc
• number of hard links to the file
• UID
• GID
• byte size
• array of block addresses:
The first several block addresses are used for data storage. Other block addresses store indirect blocks, which point at arrays containing pointers to further data blocks. Each inode contains 12 direct block pointers and 3 indirect block pointers.
• generation number (incremented each time the inode is re-used)
• access time
• modification time
• change time
• Number of sectors: This is kept to allow support for holey files, and can be reported via ls -s
• Shadow inode location: This is used for ACLs (access control lists).
Using the indirection provided in the array of block addresses, files can be created that contain holes, or large sets of null-filled bytes.
Physical I/O
Disk I/Os include the following components:
• I/O bus access: If the bus is busy, the request is queued by the driver. The information is reported by sar -d wait and %w and iostat -x avwait.
• Bus transfer time: Arbitration time (which device gets to use the bus--see the System Bus/SCSI page), time to transfer the command (usually ~ 1.5 ms), data transfer time (in the case of a write).
• Seek time: Time for the head to move to the proper cylinder. Average seek times are reported by hard drive manufacturers.
• Rotation time: Time for the correct sector to rotate under the head. This is usually calculated as 1/2 the time for a disk rotation. Rotation speeds (in RPM) are reported by hard drive manufacturers.
• ITR time: Internal Throughput Rate. This is the amount of time required for a transfer between the hard drive's cache and the device media. The ITR time is the limiting factor for sequential I/O, and is reported by the hard drive manufacturer.
• Reconnection time: After the data has been moved to/from the hard drive's internal cache, a connection with the host adapter must be completed. This is similar to the arbitration/ command transfer time discussed above.

• Interrupt time: Time for the completion interrupt to be processed. This is very hard to measure, but high interrupt rates on the CPUs associated with this system board may be an indication of problems.
The disk's ITR rating and internal cache size can be critical when tuning maxcontig (maximum contiguous I/O size). Note: maxphys and maxcontig must be tuned at the same time. The unit of measurement for maxphys is bytes; maxcontig is in blocks.
maxcontig can be changed via the mkfs, newfs or tunefs commands.
Direct I/O
Large sequential I/O can cause performance problems due to excessive use of the memory page cache. One way to avoid this problem is to use direct I/O on filesystems where large sequential I/Os are common.
Direct I/O is a mechanism for bypassing the memory page cache alltogether. It is enforced by the directio() function or by the forcedirectio option to mount.
VxFS enables direct I/O for large sequential operations. It determines which operations are "large" by comparing them to the vxtunefs parameter discovered_direct_iosz (default 256KB).
One problem that can emerge is that if large sequential I/Os are handed to VxFS as several smaller operations, caching will still occur. This problem can be alleviated by reducing discovered_direct_iosz to a level that prevents caching of the smaller operations. In particular, this can be a problem in OLTP environments. A case study of this problem is discussed on the Sun web site.

Lock Contention
Four types of locking are available on Solaris:
• Mutexes
• Semaphores (counters) (not the same as IPC semaphores)
• Condition variables (generalized semaphores)
• Multiple-reader, single-writer locks
The following types of locking problems can occur:
• Lock contention (due to excessively coarse granularity or inappropriate lock type)
• Deadlock (each process is waiting for a lock held by another process)
• Lost locks
• Race conditions
• Incomplete or buggy lock implementation
Mutex Locks
A "mutex lock" is a "mutual exclusion lock." It is created by the LDSTUB (load-store-unsigned-byte) instruction, which is an atomic (indivisible) operation that reads a byte from memory and writes 0xFF into that location. (When the lock is cleared, 0x00 is written back to the memory location.)
If the value that was read from memory is already 0xFF, another processor has already set the lock. At that point, the processor can "spin" by sitting in a loop and testing to see if the lock has cleared (i.e., been written back to 0x00). This sort of "spin lock" is usually used when the wait time for the lock is expected to be short. (If the wait is expected to be longer, the process should sleep so that the CPU can be used by another process. This is known as a "block.")
Adaptive Locks
Solaris 2.x provides a type of locking known as adaptive locks. When one thread attempts to acquire one of these that is held by another thread, it checks to see if the second thread is active on a processor. If it is, the first thread spins. If the second thread is blocked, the first thread blocks as well.
Read/Write Locks
This type of lock allows multiple concurrent reads, but prevents other accesses of the resource when writes are taking place.
Lock Contention Indicators
One indicator of a possible lock contention problem is when vmstat reports that the system is not idle, but that cpu/sy dominates cpu/us. (Note: this observation is only true if the system is not running an NFS server or other major service that runs from inside the kernel.)
One way to pin down a lock contention problem is by tracing the problem process with truss.
Another way to attempt to track down the problem is with mpstat. The smtx measurement shows the number of times a CPU failed to obtain a mutex immediately. The master CPU (the one taking the clock interrupt--usually CPU 0) will tend to have a high reading. Depending upon CPU speed, a reading of more than 500 may be an indication of a system in trouble. If the smtx is greater than 500 on a single CPU and sys dominates usr (ie, system time is larger than user time, and system time is greater than 20%), it is likely that mutex contention is occurring.
Similarly, mpstat/srw value reports on the number of times that a CPU failed to obtain a read/write lock immediately.
For Solaris 2.6 and above, the lockstat command can help to pin down the culprit. The kernel takes a performance hit while lockstat is running, so you probably only want to use this command while you are actually looking at the output.
With lockstat, look for large counts (indv), especially with long locking times (nsec).
In any case, extreme mutex contention problems should be reported to Sun. Changes have been implemented in current versions of the SunOS 5.x kernel that dramatically increase the scalability of the operating system over multiple processors. Unless additional issues are brought to the vendor's attention, they cannot be expected to correct them in future releases.

Memory and swapping
Two indicators of a RAM shortage are the scan rate and swap device activity.
In both cases, the high activity rate can be due to a process that does not have a consistent large impact on performance. The processes running on the system have to be examined to see how frequently they are run and what their impact is. It may be possible to re-work the program or run the process differently to reduce the amount of new data being read into memory. See "Process Memory Usage" below.
Whether or not to provide additional RAM for infrequent processes is a classic money/performance tradeoff. If the cost is more important than the performance, additional virtual memory space must be provided to allow enough space for the application to run. The cheapest way to do this is to provide additional swap space. If adequate total virtual memory space is not provided, new processes will not be able to open. (The system may report "Not enough space" or "WARNING: /tmp: File system full, swap space limit exceeded.")
If inadequate physical memory is provided, the system will be so busy paging to swap that it will be unable to keep up with demand. (This state is known as "thrashing" and is characterized by heavy I/O on the swap device and horrendous performance. In this state, the scanner can use up to 80% of CPU.) (For a more thorough discussion of paging, see "Paging" below.
Scan Rate
The page scanning rate is the main tipoff that a system does not have enough physical memory. Use sar -g or vmstat to look at the scan rate.
With vmstat, use vmstat 30 to check memory useage every 30 seconds. Ignore the summary statistics on the first line. If page/sr exceeds 200 pages per second for an extended time, your system may be running short of physical memory. (Shorter sampling periods may be used to get a feel for what is happening on a smaller time scale.)
A very low scan rate is a sure indicator that the system is not running short of physical memory. On the other hand, a high scan rate can be caused by transient issues, such as a process reading large amounts of uncached data. The processes on the system should be examined to see how much of a long-term impact they have on performance.
A nonzero scan rate is not necessarily an indication of a problem. Over time, memory is allocated for caching and other activities. Eventually, the amount of memory will reach the lotsfree memory level, and the pageout scanner will be invoked. For a more thorough discussion of the paging algorithm, see Paging below.

Swap Device Activity
The amount of disk activity on the swap device can be measured using iostat. For Solaris 2.6 and higher, iostat -xPnce provides information on disk activity on a partition-by-partition basis. For Solaris 2.5.1, iostat -xc provides information on a disk-by-disk basis, which may be of limited use unless swap has its own physical disk. sar -d provides similar information, and vmstat provides some usage information as well.
If there are I/O's queued for the swap device, application paging is occurring. If there is significant, heavy I/O to the swap device, a RAM upgrade may be in order.
Process Memory Usage
The /usr/proc/bin/pmap command is available in Solaris 2.6 and above. It can help pin down which process is the memory hog. /usr/proc/bin/pmap -x PID prints out details of memory use by a process.
Summary statistics regarding process size can be found in the RSS column of ps -ly or top.
dbx, the debugging utility in the SunPro package, has extensive memory leak detection built in. The source code will need to be compiled with the -g flag by the appropriate SunPro compiler.
ipcs -mb shows memory statistics for shared memory. This may be useful when attempting to size memory to fit expected traffic.
Segmentation Violations
A "segmentation violation fault" results when a process overflows its stack. The kernel recognizes the violation and can extend the stack size, up to a configurable limit.
In a multithreaded environment, the kernel does not keep track of each user thread's stack, so it cannot perform this function. The thread itself is responsible for stack SIGSEGV (stack overflow signal) handling. (The SIGSEGV signal is sent by the threads library when an attempt is made to write to a write-protected page just beyond the end of the stack. This page is allocated as part of the stack creation request.)

Swap Space
The Solaris virtual memory system combines physical memory with available swap space via swapfs. If insufficient total virtual memory space is provided, new processes will be unable to open.
Swap space can be added, deleted or examined with the swap command. swap -l reports total and free space for each of the swap partitions or files that are available to the system. Note that this number does not reflect total available virtual memory space, since physical memory is not reflected in the output. swap -s reports the total available amount of virtual memory, as does sar -r.
If swap is mounted on /tmp via tmpfs, df -k /tmp will report on total available virtual memory space, both swap and physical.
Paging
Solaris uses both common types of paging in its virtual memory system. These types are swapping (swaps out all memory associated with a user process) and demand paging (swaps out the not recently used pages). Which method is used is determined by comparing the amount of available memory with several key parameters:
• physmem: physmem is the total page count of physical memory.
• lotsfree: The page scanner is woken up when available memory falls below lotsfree. The default value for this is physmem/64; it can be tuned in the /etc/system file if necessary. The page scanner runs in demand paging mode by default. The initial scan rate is set by the kernel parameter slowscan , which is fastscan/10 by default.
• minfree: Between lotsfree and minfree, the scan rate increases linearly between slowscan and fastscan. ( minfree is set to desfree/2 and fastscan is set to physmem/4 by default.) If free memory falls below desfree ( lotsfree/2 by default), the page scanner is started 100 times per second. Each page scanner will run for desscan pages. This parameter is dynamically set based on the scan rate.
• maxpgio: maxpgio (default 40 or 60) limits the rate at which I/O is queued to the swap devices. It is set to 40 for sun4c, sun4m and sun4u architectures and 60 for sun4d architectures. If the disks are faster than 7200rpm, maxpgio can safely be set to 100 times the number of swap disks.
• throttlefree: When free memory falls below throttlefree (default minfree), the page_create routines force the calling process to wait until free pages are available.

• cachefree: If the kernel parameter priority_paging is set to 1 on a Solaris 7 system (or current patchlevels of 2.5.1 or 2.6), only data files will be targeted by the page daemon until lotsfree is reached. By default, cachefree is set to 2 x lotsfree. (Solaris 8 uses a different algorithm to determine which pages are targeted by the page daemon. priority_paging should not be set on a Solaris 8 machine.)
The page scanner operates by first freeing a usage flag on each page at a rate reported as "scan rate" in vmstat and sar -g. After handspreadpages additional pages have been read, the page scanner checks to see whether the usage flag has been reset. If not, the page is swapped out. (The default for handspreadpages is physmem/4.)
Solaris 8 Paging: Solaris 8 uses a different algorithm for removing pages from memory. This new architecture is known as the cyclical page cache. It is designed to remove most of the file system cache-induced problems with virtual memory. The new system fills the same need as priority paging does for Solaris 2.5.1-7.
The cyclical page cache uses a file system free list to cache filesystem data only. Other memory objects are managed on a separate free list. (This second list would include application binaries, shared libraries, applications and uninitialized application data.)
With the new algorithm, filesystem cache only competes with itself for memory. It does not force applications out of primary memory as sometimes happened with the earlier OS versions.
As a result of these changes, vmstat under Solaris 8 will report different statistics than would be expected under an earlier version of Solaris:
• Page Reclaim rate higher.
• Higher reported Free Memory: A large component of the filesystem cache is reported as free memory.
• Low Scan Rates: Scan rates will be near zero unless there is a systemwide shortage of available memory.
vmstat -p reports paging activity details for applications (executables), data (anonymous) and filesystem activity.

Swapping
If the system is consistently below desfree of free memory (over a 30 second average), the memory scheduler will start to swap out processes. (ie, if both avefree and avefree30 are less than desfree, the swapper begins to look at processes.)
Initially, the scheduler will look for processes that have been idle for maxslp seconds. (maxslp defaults to 20 seconds and can be tuned in /etc/system.) This swapping mode is known as soft swapping.
Swapping priorities are calculated for an LWP by the following formula:
epri = swapin_time - rss/(maxpgio/2) - pri
where swapin_time is the time since the thread was last swapped, rss is the amount of memory used by the LWPs process, and pri is the thread's priority.
If, in addition to being below desfree of free memory, there are two processes in the run queue and paging activity exceeds maxpgio, the system will commence hard swapping. In this state, the kernel unloads all modules and cache memory that is not currently active and starts swapping out processes sequentially until desfree of free memory is available.
Processes are not eligible for swapping if they are:
• In the SYS or RT scheduling class.
• Being executed or stopped by a signal.
• Exiting.
• Zombie.
• A system thread.
• Blocking a higher priority thread.
Direct I/O
Large sequential I/O can cause performance problems due to excessive use of the memory page cache. One way to avoid this problem is to use direct I/O on filesystems where large sequential I/Os are common.
Direct I/O is a mechanism for bypassing the memory page cache altogether. It is enforced by the directio() function or by the forcedirectio option to mount .
VxFS enables direct I/O for large sequential operations. It determines which operations are "large" by comparing them to the vxtunefs parameter discovered_direct_iosz (default 256KB).

NFS Troubleshooting
Sun's web pages contain substantial information about NFS services; search for an NFS Administration Guide or NFS Server Performance and Tuning Guide for the version of Solaris you are running. The share_nfs man page contains specific information about export options.
If NFS is not working at all, try the following:
• Make sure that the NFS server daemons are running. In particular, check for statd, lockd, nfsd and rarpd. If the daemons are not running, they can be started by running /etc/init.d/nfs.server start. See Daemons below for information on NFS-related daemons.
• Check the /etc/dfs/dfstab and type shareall.
• Use share or showmount -e to see which filesystems are currently exported, and to whom. showmount -a shows who the server believes is actually mounting which filesystems.
• Make sure that your name service is translating the server and client hostnames correctly on both ends. Check the server logs to see if there are messages regarding failed or rejected mount attempts; check to make sure that the hostnames are correct in these messages.
• Make sure that the /etc/net/*/hosts files on both ends report the correct hostnames. Reboot if these have to be edited.
If you are dealing with a performance issue, check
• Network Issues
• CPU Useage
• Memory Levels
• Disk I/O
• Increase the number of nfsd threads in /etc/init.d/nfs.server if the problem is that requests are waiting for a turn. Note that this does increase memory useage by the kernel, so make sure that there is enough RAM in the server to handle the additional load.
• Where possible, mount filesystem with the ro option to prevent additional, unnecessary attribute traffic.
• If attribute caching does not make sense (for example, with a mail spool), mount the filesystem with the noac option. If nfsstat reports a high getattr level, actimeo may need to be increased (if the attributes do not change too often).
• nfsstat reports on most NFS-related statistics. The nfsstat page includes information on tuning suggestions for different types of problems that can be revealed with nfsstat.
If these steps do not resolve the issue, structural changes may be required:
• cachefs can be used to push some of the load from the NFS server onto the NFS clients. To be useful, cfsadmin should be used to increase maxfilesize for the cache to a value high enough to allow for the caching of commonly-used files. (The default value is 3 Mb.)
NFS Client
When a client makes a request to the NFS server, a file handle is returned. The file handle is a 32 byte structure which is interpreted by the NFS server. Commonly, the file handle includes a file system ID, inode number and the generation number of the inode. (The latter can be used to return a "stale file handle" error message if the inode has been freed and re-used between client file accesses.)
If a response is not received for a request, it is resent, but with an incremented xid (transmission ID). This can happen because of congestion on the network or the server, and can be observed with a snoop session between server and client.
The server handles retransmissions differently depending on whether the requests are idempotent (can be executed several times without ill effect) or nonidempotent (cannot be executed several times). Examples of these would include things like reads and getattrs versus writes, creates and removes. The system maintains a cache of nonidempotent requests so that appropriate replies can be returned.
Daemons
The following daemons play a critical role in NFS service:
• biod: On the client end, handles asynchronous I/O for blocks of NFS files.
• nfsd: Listens and responds to client NFS requests.
• mountd: Handles mount requests.
• lockd: Network lock manager.
• statd: Network status manager.
Process Accounting
Process accounting can be turned on by referencing the /etc/init.d/acct file, either in a command line with start/stop, or by linking to the appropriate rc file:

ln -s /etc/init.d/acct /etc/rc2.d/S22acct

ln -s /etc/init.d/acct /etc/rc0.d/K22acct
The /etc/acct/holidays file may also need to be edited, if different monitoring is desired during holidays.
The root and adm crontabs can also be edited to run dodisk, ckpacct, monacct, and runacct.
Process accounting logs can be examined directly using acctcom.

Network Debugging
ifconfig
ifconfig is the primary command to use for debugging network interface problems, especially ifconfig -a. If necessary, ifconfig -a statements can be inserted into rc scripts to track interface condition during the boot process.
The first thing to check is that all values from ifconfig -a are as expected (FLAGS=UP and RUNNING, MTU=1500 for ethernet, INET=IP address, NETMASK correct (255.255.252.0 for Princeton), BROADCAST correct, ETHER=ethernet address).
netstat
netstat provides useful information regarding traffic flow. In particular, netstat -i lists statistics for each interface, netstat -s provides a full listing of several counters, and netstat -rs provides routing table statistics. netstat -k provides a useful summary of several network-related statistics, but this option is officially unsupported and may be removed in a future release.
Here are some of the issues that can be revealed with netstat:
• netstat -i: (Collis+Ierrs+Oerrs)/(Ipkts+Opkts) > 2%: This may indicate a network hardware issue.
• netstat -i: (Collis/Opkts) > 10%: The interface is overloaded. Traffic will need to be reduced or redistributed to other interfaces or servers.
• netstat -i: (Ierrs/Ipkts) > 25%: Packets are probably being dropped by the host, indicating an overloaded network (and/or server). Retransmissions can be dropped by reducing the rsize and wsize mount parameters to 2048 on the clients. Note that this is a temporary workaround, since this has the net effect of reducing maximum NFS throughput on the segment.
• netstat -s: If significant numbers of packets arrive with bad headers, bad data length or bad checksums, check the network hardware.
• netstat -i: If there are more than 120 collisions/second, the network is overloaded. See the suggestions above.
• netstat -i: If the sum of input and output packets is higher than about 600/second for a 10Mbs interface or 6000/second for a 100Mbs interface, the network segment is too busy. See the suggestions above.
snoop
snoop provides a snapshot of network traffic. This utility gives a definitive answer to the question whether packets are arriving at their destination.
ping and traceroute
ping -sRv (or traceroute, if available) can provide useful routing information that may pinpoint the source of network congestion.

System Configuration Files

For details about the files and commands summarized here, consult the appropriate man pages or http://docs.sun.com/
File Description
/etc/bootparams Contains information regarding network boot clients.
/etc/cron.d/cron.allow
/etc/cron.d/cron.deny Allow access to crontab for users listed in this file. If the file does not exist, access is permitted for users not in the /etc/cron.d/cron.deny file.
/etc/defaultdomain NIS domain set by /etc/init.d/inetinit
/etc/default/cron Sets cron logging with the CRONLOG variable.
/etc/default/login Controls root logins via specification of the CONSOLE variable, as well as variables for login logging thresholds and password requirements.
/etc/default/su Determines logging activity for su attempts via the SULOG and SYSLOG variables, sets some initial environment variables for su sessions.
/etc/dfs/dfstab Determines which directories will be NFS-shared at boot time. Each line is a share command.
/etc/dfs/sharetab Contains a table of resources that have been shared via share.
/etc/group Provides groupname translation information.
/etc/hostname.interface Assigns a hostname to interface; assigns an IP address by cross- referencing /etc/inet/hosts.
/etc/hosts.allow
/etc/hosts.deny Determine which hosts will be allowed access to TCP wrapper mediated services.
/etc/hosts.equiv Determines which set of hosts will not need to provide passwords when using the "r" remote access commands (eg rlogin, rsh, rexec)
/etc/inet/hosts
/etc/hosts Associates hostnames and IP addresses.
/etc/inet/inetd.conf
/etc/inetd.conf Identifies the services that are started by inetd as well as the manner in which they are started. inetd.conf may even specify that TCP wrappers be used to protect a service.
/etc/inittab inittab is used by init to determine scripts to for different run levels as well as a default run level.
/etc/logindevperm Contains information to change permissions for devices upon console logins.
/etc/magic Database of magic numbers that identify file types for file.
/etc/mail/aliases
/etc/aliases Contains mail aliases recognized by sendmail.
/etc/mail/sendmail.cf
/etc/sendmail.cf Mail configuration file for sendmail.
/etc/minor_perm Specifies permissions for device files; used by drvconfig
/etc/mnttab Contains information about currently mounted resources.
/etc/name_to_major List of currently configured major device numbers; used by drvconfig.
/etc/netconfig Network configuration database read during network initialization.
/etc/netgroup Defines groups of hosts and/or users.
/etc/netmasks Determines default netmask settings.
/etc/nsswitch.conf Determines order in which different information sources are accessed when performing lookups.
/etc/path_to_inst Contents of physical device tree using physical device names and instance numbers.
/etc/protocols Known protocols.
/etc/remote Attributes for tip sessions.
/etc/rmtab Currently mounted filesystems.
/etc/rpc Available RPC programs.
/etc/services Well-known networking services and associated port numbers.
/etc/syslog.conf Configures syslogd logging.
/etc/system Can be used to force kernel module loading or set kernel tuneable parameters.
/etc/vfstab Information for mounting local and remote filesystems.
/var/adm/messages Main log file used by syslogd.
/var/adm/sulog Default log for recording use of su command.
/var/adm/utmpx User and accounting information.
/var/adm/wtmpx User login and accounting information.
/var/local/etc/ftpaccess
/var/local/etc/ftpconversions
/var/local/etc/ftpusers wu-ftpd configuration files to set ftp access rights, conversion/compression types, and a list of userids to exclude from ftp operations.
/var/lp/log Print services activity log.
/var/sadm/install/contents Database of installed software packages.
/var/saf/_log Logs activity of SAF (Service Access Facility).

The /etc/inittab File
The /etc/inittab file plays a crucial role in the boot sequence.
The line entries in the inittab file have the following format:
id:runlevel:action:process
Here the id is a two-character unique identifier, runlevel indicates the run level involved, action indicates how the process is to be run, and process is the command to be executed.
At boot time, all entries with runlevel "sysinit" are run. Once these processes are run, the system moves towards the init level indicated by the "initdefault" line. For a default inittab, the line is:
is:3:initdefault:
(This indicates a default runlevel of 3.)
By default, the first script run from the inittab file is /etc/bcheckrc, which checks the state of the root and /usr filesystems. The line controlling this script has the following form:
fs::sysinit:/sbin/bcheckrc >/dev/console 2>&1 The inittab also controls what happens at each runlevel. For example, the default entry for runlevel 2 is:
s2:23:wait:/sbin/rc2 >/dev/console 2>&1 The action field of each entry will contain one of the following keywords:
• powerfail: The system has received a "powerfail" signal.
• wait: Wait for the command to be completed before proceeding.
• respawn: Restart the command.

Solaris 2.x Core Dump Analysis
If you are having trouble getting a core dump, see the savecore page.
Several useful pieces of information can be found by running strings on the vmcore.# file and piping it through more or grep: strings vmcore.# | more. In particular, this will tell you the architecture of the system and the OS level. The message buffer is also near the top of the strings output, and may include messages that had not been written to the console or log files yet. (Note that the message buffer is a ring buffer, so the messages may not be in cronological order.)
If the system panic'ed due to a bad trap, adb can determine the instruction that was running at the time of the crash.
crash can also provide useful information, including lock and kernel memory allocation. crash output is notably less cryptic than adb output.
Some netstat, nfsstat and arp commands are also available for crash dumps. After running the command on the corefiles (eg netstat -d unix.# vmcore.#), compare the output to that on the live system, since some of the options do not report the crash dump statistics.
ipcs can also be used with crash dumps in the following format: ipcs -a -C vmcore.# -N unix.# See the IPC page for information on these facilities.

Changing a hostname
The following steps are required to change a Sun system's hostname.
• /etc/hosts.allow (to correct access permissions)
• /etc/dfs/dfstab on this system's NFS servers (to allow proper mount access)
• /etc/vfstab on this system's NFS clients (so they will point at the correct server)
• kerberos configurations
• ethers and hosts NIS maps
• DNS information
• Netgroup information
• cron jobs should be reviewed.
• Other hostname-specific scripts and configuration files.
(Additional steps may be required in order to correct issues involving other systems)
Having said all that, the minumum number of changes required are:
• /etc/nodename
• /etc/hosts
• /etc/hostname.*
• /etc/net/*/hosts

Error Message interpretation

The AnswerBook contains an alphabetical listing of common error messages.
Traps and interrupts can be blocked by a kernel thread's signal mask, or they can trigger an exception handling routine. In the absence of such a routine or mask, the process is terminated.
Traps
Traps are syncronous messages generated by the process or its underlying kernel thread. Examples include SIGSEGV, SIGPIPE and SIGSYS. They are delivered to the process that caused the signal.
Trap messages can be discovered in a number of places, including error logs, adb output, and console messages. Sun provides a couple of files that can help determine the type of trap encountered:
• /usr/include/sys/trap.h (software traps)
• /usr/include/v7/sys/machtrap.h (hardware traps, 32 bit)
• /usr/include/v9/sys/machtrap.h (hardware traps, 64 bit)
ECC (Error Checking and Correcting) interrupts are reported as traps when a bit error is corrected. These, while they do not crash the system, are usually a signal that the memory chip in question needs to be replaced.
Critical errors include things like fan/temperature warnings or power loss that require immediate attention and shutdown.
Fatal errors are hardware errors where proper system function cannot be guaranteed. These result in a watchdog reset.
Bus Errors
A bus error is issued to the processor when it references a location that cannot be accessed.
• Illegal address: (usually a software failure)
• Instruction fetch/Data load: (device driver bug)
• DVMA: (on an Sbus system)
• Synchronous/asynchronous data store
• MMU: (Memory Management Unit: can be hardware or software, but frequently are system board problems.)
Interrupts
These notify the CPU of external device conditions that are asynchronous with normal operation. They can be delivered to the responsible process or kernel thread.
In Solaris, interrupts are handled by dedicated interrupt-handling kernel threads, which use mutex locks and semaphores. The kernel will block interrupts in a few exceptional circumstances, such as during the process of acquiring a mutex lock protecting a sleep queue.
• Device done or ready.
• Error detected.
• Power on/off.
Watchdog Reset
Watchdog resets can be caused by hardware or software issues. See the watchdog reset page for information on how to troubleshoot watchdog resets.

Solaris Filesystem Troubleshooting
Filesystem corruption can be detected and often repaired by the format and fsck commands. If the filesystem corruption is not due to an improper system shutdown, the hard drive hardware may need to be replaced.
ufs filesystems contain the following types of blocks:
• boot block: This stores information used to boot the system.
• superblock: Much of the filesystems internal information is stored in these.
• inode: Stores location information about a file--everything except for the file name. The number of inodes in a filesystem can be changed from the default if newfs -i is used to create the filesystem.
• data block: The file's data is stored in these.
fsck:
The fsck command is run on each filesystem at boot time. This utility checks the internal consistency of the filesystem, and can make simple repairs on its own. More complex repairs require feedback from the root user, either in terms of a "y" keyboard response to queries, or invocation with the -y option.
If fsck cannot determine where a file belongs, the file may be renamed to its inode number and placed in the filesystem's lost+found directory. If a file is missing after a noisy fsck session, it may still be intact in the lost+found directory.
Sometimes the fsck command complains that it cannot find the superblock. Alternative superblock locations were created by newfs at the time that the filesystem was created. The newfs -N command can be invoked to nondestructively discover the superblock locations for the filesystem.
ufs filesystems can carry "state flags" that have the value of fsclean, fsstable, fsactive or fsbad (unknown). These can be used by fsck during boot time to skip past filesystems that are believed to be okay.
format:
The analyze option of format can be used to examine the hard drive for flaws in a nondestructive fashion.
df:
df can be used to check a filesystem's available space. Of particular interest is df -kl, which checks available space for all local filesystems and prints out the statistics in kilobytes.
du:
du can be used to check space used by a directory. In particular, du -dsk will report useage in kilobytes of a directory and its descendants, without including space totals from other filesystems.
Filesystem Tuning
Filesystem performance can be improved by looking at filesystem caching issues.
The following tuning parameters may be valuable in tuning filesystem performance with tunefs or mkfs/newfs:
• inode count: The default is based upon an assumption of average file sizes of 2 KB. This can be set with mkfs/newfs at the time of filesystem creation.
• time/space optimization: Optimization can be set to allow for fastest performance or most efficient space useage.
• minfree: In Solaris 2.6+, this is set to (64 MB / filesystem size) x 100. Filesystems in earlier OS versions reserved 10%. This parameter specifies how much space is to be left empty in order to preserve filesystem performance.
• maxbpg: This is the maximum number of blocks a file can leave in a single cylinder group. Increasing this limit can improve large file performance, but may have a negative impact on small file performance.

kmastat
Kernel memory size can be tracked using the sar -k command. The total of the "alloc" fields is the kernel memory size. If it appears to be growing without bound, you may have a memory leak. It should be noted that not all buckets are tracked by sar -k, so the reported memory size is not as accurate as that reported by crash.
On occasion there are problems related to memory leaks in the kernel or one of the associated modules. In these cases, kmastat can provide useful information pinpointing the source of the leak.
To check on kernel memory allocations on a running system, run a crash session as follows:

# crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> kmastat
The first number on the "Total" line represents the total amount of memory allocated by the kernel. If this is a significant fraction of available system memory and growing, there is a problem.
The output from the kmastat command also contains information on a number of "buckets" or categories for memory allocation.
Additional information can be obtained via the kmausers command, but this requires that we load kadb prior to booting. To do this, reach the ok> prompt, then:

ok> boot kadb -d
kadb: (hit the "return" key)
kadb[0]: kmem_flags/W 01
kadb[0]: :c
Loading kadb this way means that kadb will only be effective for this current boot session.
Once the system is up, we can either force a core dump via STOP-A/ ok> sync, or we can examine the live system. In either case, inside the crash session we would type:

>kmausers bucket_name
The result will show memory allocations inside that bucket. The names of functions inside each allocation will be a tip-off to what is actually grabbing the memory allocation.
A script can be run from cron to capture this information. The format of this script would be something like:

#!/bin/sh
date >> log_file
echo "kmastat" | /usr/sbin/crash -w log_file
sleep 20
echo "kmausers kmem_alloc_2048" | /usr/sbin/crash -w log_file
Slab Allocator
Solaris 2.4+ uses a kernel memory allocator known as a slab allocator.
A kernel memory allocator performs the following functions:
• Allocate memory
• Initialize objects/structures
• Use objects/structures
• Deconstruct objects/structures
• Free memory
The structures in the memory objects include sub-objects such as linked list headers, mutexes, reference counts and condition variables. In the case of Solaris, the deconstruction step includes setting objects to their initial settings, which can save time when the memory objects have to be re-initialized.
A translation lookaside buffer (TLB) is an associative cache of recent address translations. When the MMU (memory management unit) cannot find a translation in the TLB, it lookus it up in the address maps and loads the address into the TLB. Entries in the TLB are replaced on a least recently used basis.
The slab allocator is organized as a collection of object caches. Each of these caches contains only one type of object (proc structures, vnodes, etc). The kernel is responsible for restoring each object to its initial state when it is released. When a cache requires additional space, the allocator gives a slab of memory from the page-level allocator and creates objects from it. The slab contains enough memory for several object instances. A small part of the slab is used by the cache to manage memory in the slab; the rest is divided into buffers that are the size of the object. The allocator then initializes these buffers with the appropriate constructor.
When the page-level allocator needs to recover memory, unused slabs are reaped by deconstructing the objects on slabs whose objects are all free, then removing the slab from the cache in question.
The structure for each slab includes unused space at the beginning of the slab (coloring area), the set of objects, more unused space (the amount left over after the maximum number of objects has been created), and a slab data area. Each object also includes a four byte area for a free list pointer. The slab data area includes a count of in-use objects, pointers for a doubly-linked list of slabs in the same cache, and a pointer to the first free area in the slab. The coloring areas are different sizes for each slab in a cache (where possible). This allows a balanced distribution of traffic on the hardware caches and memory busses by varying the offsets for the different slabs.
Large object slabs are slightly different in that management data is stored in a separate pool of memory, since large slabs are usually multiples of a page in size. A hash table is also maintained to provide lookups between the management area and the slabs.

Kernel Modules
• modinfo prints out information on all currently loaded kernel modules.
• modload loads kernel modules.
• modunload unloads kernel modules.
• forceload in the /etc/system file loads a module at boot time.

Solaris process scheduling
In Solaris, highest priorities are scheduled first. Kernel thread scheduling information can be revealed with ps -elcL.
A process can exist in one of the following states: running, sleeping or ready.
The following scheduling classes exist in Solaris:
• Timesharing (TS): Normal user work. The CPU is shared in rotation between threads at the same priority via time slicing. Compute-bound operations have their priority lowered and I/O-bound operations have their priorities raised.
• Interactive (IA): Interactive class. This class is the same as the TS class plus a priority boost that is given to the task in the active window.
• System (SYS): Kernel priorities. This class is used for system threads (eg page daemon). Threads in this class do not share the CPU via time slicing; they run until finished or pre-empted. This class also features fixed priority levels.
• Real Time (RT): Used for processes that require immediate system access, usually critical hardware systems. This class has the highest priority except for interrupt handling. The CPU is shared via time slicing if there are several threads with the same priority. Real time threads have a fixed priority for the duration of their lives.
If the RT scheduling class has not been activated, then TS and IA processes will have priorities between 0 and 59, SYS threads will have priorities of 60-99, and interrupt threads will have priorities between 100 and 109.
If RT has been activated, TS, IA and SYS will be as above. RT will have priorities between 100 and 159, and interrupt threads will have priorities between 160 and 169.
Time Slicing
TS and IA scheduling classes implement an adaptive time slicing scheme that increases the priority of I/O-bound processes at the expense of compute-bound processes. The exact values that are used to implement this can be found in the dispatch table. To examine the TS dispatch table, run the command dispadmin -c TS -g. (If units are not specified, dispadmin reports time values in ms.)
The following values are reported in the dispatch table:
• ts_quantum: This is the default length of time assigned to a process with the specified priority.
• ts_tqexp: This is the new priority that is assigned to a process that uses its entire time quantum.
• ts_slpret: The new priority assigned to a process that blocks before using its entire time quantum.
• ts_maxwait: If a thread does not receive CPU time during a time interval of ts_maxwait, its priority is raised to ts_lwait.
• ts_lwait:
The man page for ts_dptbl contains additional information about these parameters.
dispadmin can be used to edit the dispatch table to affect the decay of priority for compute-bound processes or the growth in priority for I/O-bound processes. Obviously, the importance of the different types of processing on different systems will make a difference in how these parameters are tweaked. In particular, ts_maxwait and ts_lwait can prevent CPU starvation, and raising ts_tqexp slightly can slow the decline in priority of CPU-bound processes.
In any case, the dispatch tables should only be altered slightly at each step in the tuning process, and should only be altered at all if you have a specific goal in mind.
The following are some of the sorts of changes that can be made:
• Decreasing ts_quantum favors IA class objects.
• Increasing ts_quantum favors compute-bound objects.
• ts_maxwait and ts_lwait control CPU starvation.
• ts_tqexp can cause compute-bound objects' priorities to decay more or less rapidly.
• ts_slpret can cause I/O-bound objects' priorities to rise more or less rapidly.
RT objects time slice differently in that ts_tqexp and ts_slpret do not increase or decrease the priority of the
IA objects add 10 to the regular TS priority of the process in the active window. This priority shifts with the focus on the active window. object. Each RT thread will execute until its time slice is up or it is blocked while waiting for a resource.
Callouts
Solaris handles callouts with a callout thread that runs at maximum system priority, which is still lower than any RT thread. RT callouts are handled separately and are invoked at the lowest interrupt level, which ensures prompt processing.
Priority Inheritance
Each thread has two priorities: global priority and inherited priority. The inherited priority is normally zero unless the thread is sitting on a resource that is required by a higher priority thread.
When a thread blocks on a resource, it attempts to "will" or pass on its priority to all threads that are directly or indirectly blocking it. The pi_willto() function checks each thread that is blocking the resource or that is blocking a thread in the syncronization chain. When it sees threads that are a lower priority, those threads inherit the priority of the blocked thread. It stops traversing the syncronization chain when it hits an object that is not blocked or is higher priority than the willing thread.
This mechanism is of limited use when considering condition variable, semaphore or read/write locks. In the latter case, an owner-of-record is defined, and the inheritance works as above. If there are several threads sharing a read lock, however, the inheritance only works on one thread at a time.
Thundering Herd
When a resource is freed, all threads awaiting that resource are woken. This results in a footrace to obtain access to that object; one succeeds and the others return to sleep. This can lead to wasted overhead for context switches, as well as a problem with lower priority threads obtaining access to an object before a higher-priority thread. This is called a "thundering herd" problem.
Priority inheritance is an attempt to deal with this problem, but some types of syncronization do not use inheritance.
Turnstiles
Each syncronization object (lock) contains a pointer to a structure known as a turnstile. These contain the data needed to manipulate the syncronization object, such as a queue of blocked threads and a pointer to the thread that is currently using the resource. Turnstiles are dynamically allocated based on the number of allocated threads on the system. A turnstile is allocated by the first thread that blocks on a resource and is freed when no more threads are blocked on the resource.
Turnstiles queue the blocked threads according to their priority. Turnstiles may issue a signal to wake up the highest-priority thread, or they may issue a broadcast to wake up all sleeping threads.
Adjusting Priorities
The priority of a process can be adjusted with priocntl or nice, and the priority of an LWP can be controlled with priocntl().
Real Time Issues
STREAMS processing is moved into its own kernel threads, which run at a lower priority than RT threads. If an RT thread places a STREAMS request, it may be serviced at a lower priority level than is merited.
Real time processes also lock all their pages in memory. This can cause problems on a system that is underconfigured for the amount of memory that is required.
Since real time processes run at such a high priority, system daemons may suffer if the real time process does not permit them to run.
When a real time process forks, the new process also inherits real time privileges. The programmer must take care to prevent unintended consequences. Loops can also be hard to stop, so the programmer also needs to make sure that the program does not get caught in an infinite loop.
Interrupts
Interrupt levels run between 0 and 15. Some typical interrupts include:
• soft interrupts
• SCSI/FC disks (3)
• Tape, Ethernet
• Video/graphics
• clock() (10)
• serial communications
• real-time CPU clock
• Nonmaskable interrupts (15)

Solaris Processes
The process is one of the fundamental abstactions of Unix. Every object in Unix is represented as either a file or a process. (With the introduction of the /proc structure, there has been an effort to represent even processes as files.)
Processes are usually created with fork or a less resource intensive alternative such as fork1 or vfork. fork duplicates the entire process context, while fork1 only duplicates the context of the calling thread. This can be useful (for example), when exec will be called shortly.
Solaris, like other Unix systems, provides two modes of operation: user mode, and kernel (or system) mode. Kernel mode is a more privileged mode of operation. Processes can be executed in either mode, but user processes usually operate in user mode.
Per-process Virtual Memory
Each process has its own virtual memory space. References to real memory are provided through a process-specific set of address translation maps. The computer's Memory Management Unit (MMU) contains a set of registers that point to the current process's address translation maps. When the current process changes, the MMU must load the translation maps for the new process. This is called a context switch.
The MMU is only addressable in kernel mode, for obvious security reasons.
The kernel text and data structures are mapped in a portion of each process's virtual memory space. This area is called the kernel space (or system space).
In addition, each process contains these two important kernel-owned areas in virtual memory: u area and kernel stack. The u area contains information about the process such as information about open files, identification information and process registers. The kernel stack is provided on a per-process basis to allow the kernel to be re-entrant. (ie, several processes can be involved in the kernel, and may even be executing the same routine concurrently.) Each process's kernel stack keeps track of its function call sequence when executing in the kernel.
The kernel can access the memory maps for non-current processes by using temporary maps.
The kernel can operate in either process context or system (or interrupt) context. In process context, the kernel has access to the process's memory space (including u area and kernel stack). It can also block the current process while waiting for a resource. In kernel context, the kernel cannot access the address space, u area or kernel stack. Kernel context is used for handling certain system-wide issues such as device interrupt handling or process priority computation.
Additional information is available on the Process Virtual Memory page.
Process Context
Each process's context contains information about the process, including the following:
• Hardware context:
o Program counter: address of the next instruction.
o Stack pointer: address of the last element on the stack.
o Processor status word: information about system state, with bits devoted to things like execution modes, interrupt priority levels, overflow bits, carry bits, etc.
o Memory management registers: Mapping of the address translation tables of the process.
o Floating point unit registers.
• User address space: program text, data, user stack, shared memory regions, etc.
• Control information: u area, proc structure, kernel stack, address translation maps.
• Credentials: user and group IDs (real and effective).
• Environment variables: strings of the form variable= value.
During a context switch, the hardware context registers are stored in the Process Control Block in the u area.
The u area includes the following:
• Process control block.
• Pointer to the proc structure.
• Real/effective UID/GID.
• Information regarding current system call.
• Signal handlers.
• Memory management information (text, data, stack sizes).
• Table of open file descriptors.
• Pointers to the current directory vnode and the controlling terminal vnode.
• CPU useage statistics.
• Resource limitations (disk quotas, etc)
The proc structure includes the following:
• Identification: process ID and session ID
• Kernel address map location.
• Current process state.
• Pointers linking the process to a scheduler queue or sleep queue.
• Pointers linking this process to lists of active, free or zombie processes.
• Pointers keeping this structure in a hash queue based on PID.
• Sleep channel (if the process is blocked).
• Scheduling priority.
• Signal handling information.
• Memory management information.
• Flags.
• Information on the relationship of this process and other processes.
Kernel Services
The Solaris kernel may be seen as a bundle of kernel threads. It uses synchronization primitives to prevent priority inversion. These include mutexes, semaphores, condition variables and read/write locks.
The kernel provides service to processes in the following four ways:
• System Calls: The kernel executes requests submitted by processes via system calls. The system call interface invokes a special trap instruction.
• Hardware Exceptions: The kernel notifies a process that attempts several illegal activities such as dividing by zero or overflowing the user stack.
• Hardware Interrupts: Devices use interrupts to notify the kernel of status changes (such as I/O completions).
• Resource Management: The kernel manages resources via special processes such as the pagedaemon.
In addition, some system services (such as NFS service) are contained within the kernel in order to reduce overhead from context switching.
Threads
An application's parallelism is the degree of parallel execution acheived. In the real world, this is limited by the number of processors available in the hardware configuration. Concurrency is the maximum acheivable parallelism in a theoretical machine that has an unlimited number of processors. Threads are frequently used to increase an application's concurrency.
A thread represents a relatively independent set of instructions within a program. A thread is a control point within a process. It shares global resources within the context of the process (address space, open files, user credentials, quotas, etc). Threads also have private resources (program counter, stack, register context, etc).
The main benefit of threads (as compared to multiple processes) is that the context switches are much cheaper than those required to change current processes. Sun reports that a fork() takes 30 times as long as an unbound thread creation and 5 times as long as a bound thread creation.
Even within a single-processor environment, multiple threads are advantageous because one thread may be able to progress even though another thread is blocked while waiting for a resource.
Interprocess communication also takes considerably less time for threads than for processes, since global data can be shared instantly.
Kernel Threads
A kernel thread is the entity that is scheduled by the kernel. If no lightweight process is attached, it is also known as a system thread. It uses kernel text and global data, but has its own kernel stack, as well as a data structure to hold scheduling and syncronization information.
Kernel threads store the following in their data structure:
• Copy of the kernel registers.
• Priority and scheduling information.
• Pointers to put the thread on the scheduler or wait queue.
• Pointer to the stack.
• Pointers to associated LWP and proc structures.
• Pointers to maintain queues of threads in a process and threads in the system.
• Information about the associated LWP (as appropriate).
Kernel threads can be independently scheduled on CPUs. Context switching between kernel threads is very fast because memory mappings do not have to be flushed.
Lightweight Processes
A lightweight process can be considered as the swappable portion of a kernel thread.
Another way to look at a lightweight process is to think of them as "virtual CPUs" which perform the processing for applications. Application threads are attached to available lightweight processes, which are attached to a a kernel thread, which is scheduled on the system's CPU dispatch queue.
LWPs can make system calls and can block while waiting for resources. All LWPs in a process share a common address space. IPC (interprocess communication) facilities exist for coordinating access to shared resources.
LWPs contain the following information in their data structure:
• Saved values of user-level registers (if the LWP is not active)
• System call arguments, results, error codes.
• Signal handling information.
• Data for resource useage and profiling.
• Virtual time alarms.
• User time/CPU usage.
• Pointer to the associated kernel thread.
• Pointer to the associated proc structure.
By default, one LWP is assigned to each process; additional LWPs are created if all the process's LWPs are sleeping and there are additional user threads that libthread can schedule. The programmer can specify that threads are bound to LWPs.
Lightweight process information for a process can be examined with ps -elcL.
User Threads
User threads are scheduled on their LWPs via a scheduler in libthread. This scheduler does implement priorities, but does not implement time slicing. If time slicing is desired, it must be programmed in.
Locking issues must also be carefully considered by the programmer in order to prevent several threads from blocking on a single resource.
User threads are also responsible for handling of SIGSEGV (segmentation violation) signals, since the kernel does not keep track of user thread stacks.
Each thread has the following characteristics:
• Has its own stack.
• Shares the process address space.
• Executes independently (and perhaps concurrently with other threads).
• Completely invisible from outside the process.
• Cannot be controlled from the command line.
• No system protection between threads in a process; the programmer is responsible for interactions.
• Can share information between threads without IPC overhead.
Priorities
Higher numbered priorities are given precedence. The scheduling page contains additional information on how priorities are set.
Zombie Processes
When a process dies, it becomes a zombie process. Normally, the parent performs a wait() and cleans up the PID. Sometimes, the parent receives too many SIGCHLD signals at once, but can only handle one at a time. It is possible to resend the signal on behalf of the child via kill -18 PPID. Killing the parent or rebooting will also clean up zombies. The correct answer is to fix the buggy parent code that failed to perform the wait() properly.
Aside from their inherent sloppiness, the only problem with zombies is that they take up a place in the process table.
Kernel Tunables
The following kernel tunables are important when looking at processes:
• maxusers: By default, this is set to 2 less than the number of Mb of physical memory, up to 1024. It can be set up to 2048 manually in the /etc/system file.
• max_nprocs: Maximum number of processes that can be active simultaneously on the system. The default for this is (16 x maxusers) + 10. The minimum setting for this is 138, the maximum is 30,000.
• maxuprc: The default setting for this is max_nprocs - 5. The minimum is 133, the maximum is . This is the numberof processes a single non-root user can create.
• ndquot: This is the number of disk quota structures. The default for this is (maxusers x 10) + max_nprocs. The minimum is 213.
• pt_cnt: Sets the number of System V ptys.
• npty: Sets the number of BSD ptys. (Should be set to pt_cnt.)
• sad_cnt: Sets the number of STREAMS addressable devices. (Should be set to 2 x pt_cnt.)
• nautopush: Sets the number of STREAMS autopush entries. (Should be set to pt_cnt.)
• ncsize: Sets DNLC size.
• ufs_ninode: Sets inode cache size.
proc Commands
The proc tools are useful for tracing attributes of processes. These utilities include:
• pflags: Prints the tracing flags, pending and held signals and other /proc status information for each LWP.
• pcred: Prints credentials (ie, EUID/EGID, RUID/EGID, saved UID/GIDs).
• pmap: Prints process address space map.
• pldd: Lists dynamic libraries linked to the process.
• psig: Lists signal actions.
• pstack: Prints a stack trace for each LWP in the process.
• pfiles: Reports fstat, fcntl information for all open files.
• pwdx: Prints each process's working directory.
• pstop: Stops process.
• prun: Starts stopped process.
• pwait: Wait for specified processes to terminate.
• ptree: Prints process tree for process.
• ptime: Times the command using microstate accounting; does not time children.
These commands can be run against a specific process, but most of them can also be run against all processes on the system. See the above- referenced

Solaris process scheduling
In Solaris, highest priorities are scheduled first. Kernel thread scheduling information can be revealed with ps -elcL.
A process can exist in one of the following states: running, sleeping or ready.
The following scheduling classes exist in Solaris:
• Timesharing (TS): Normal user work. The CPU is shared in rotation between threads at the same priority via time slicing. Compute-bound operations have their priority lowered and I/O-bound operations have their priorities raised.
• Interactive (IA): Interactive class. This class is the same as the TS class plus a priority boost that is given to the task in the active window.
• System (SYS): Kernel priorities. This class is used for system threads (eg page daemon). Threads in this class do not share the CPU via time slicing; they run until finished or pre-empted. This class also features fixed priority levels.
• Real Time (RT): Used for processes that require immediate system access, usually critical hardware systems. This class has the highest priority except for interrupt handling. The CPU is shared via time slicing if there are several threads with the same priority. Real time threads have a fixed priority for the duration of their lives.
If the RT scheduling class has not been activated, then TS and IA processes will have priorities between 0 and 59, SYS threads will have priorities of 60-99, and interrupt threads will have priorities between 100 and 109.
If RT has been activated, TS, IA and SYS will be as above. RT will have priorities between 100 and 159, and interrupt threads will have priorities between 160 and 169.
Time Slicing
TS and IA scheduling classes implement an adaptive time slicing scheme that increases the priority of I/O-bound processes at the expense of compute-bound processes. The exact values that are used to implement this can be found in the dispatch table. To examine the TS dispatch table, run the command dispadmin -c TS -g. (If units are not specified, dispadmin reports time values in ms.)
The following values are reported in the dispatch table:
• ts_quantum: This is the default length of time assigned to a process with the specified priority.
• ts_tqexp: This is the new priority that is assigned to a process that uses its entire time quantum.
• ts_slpret: The new priority assigned to a process that blocks before using its entire time quantum.
• ts_maxwait: If a thread does not receive CPU time during a time interval of ts_maxwait, its priority is raised to ts_lwait.
• ts_lwait:
The man page for ts_dptbl contains additional information about these parameters.
dispadmin can be used to edit the dispatch table to affect the decay of priority for compute-bound processes or the growth in priority for I/O-bound processes. Obviously, the importance of the different types of processing on different systems will make a difference in how these parameters are tweaked. In particular, ts_maxwait and ts_lwait can prevent CPU starvation, and raising ts_tqexp slightly can slow the decline in priority of CPU-bound processes.
In any case, the dispatch tables should only be altered slightly at each step in the tuning process, and should only be altered at all if you have a specific goal in mind.
The following are some of the sorts of changes that can be made:
• Decreasing ts_quantum favors IA class objects.
• Increasing ts_quantum favors compute-bound objects.
• ts_maxwait and ts_lwait control CPU starvation.
• ts_tqexp can cause compute-bound objects' priorities to decay more or less rapidly.
• ts_slpret can cause I/O-bound objects' priorities to rise more or less rapidly.
RT objects time slice differently in that ts_tqexp and ts_slpret do not increase or decrease the priority of the
IA objects add 10 to the regular TS priority of the process in the active window. This priority shifts with the focus on the active window. object. Each RT thread will execute until its time slice is up or it is blocked while waiting for a resource.
Callouts
Solaris handles callouts with a callout thread that runs at maximum system priority, which is still lower than any RT thread. RT callouts are handled separately and are invoked at the lowest interrupt level, which ensures prompt processing.
Priority Inheritance
Each thread has two priorities: global priority and inherited priority. The inherited priority is normally zero unless the thread is sitting on a resource that is required by a higher priority thread.
When a thread blocks on a resource, it attempts to "will" or pass on its priority to all threads that are directly or indirectly blocking it. The pi_willto() function checks each thread that is blocking the resource or that is blocking a thread in the syncronization chain. When it sees threads that are a lower priority, those threads inherit the priority of the blocked thread. It stops traversing the syncronization chain when it hits an object that is not blocked or is higher priority than the willing thread.
This mechanism is of limited use when considering condition variable, semaphore or read/write locks. In the latter case, an owner-of-record is defined, and the inheritance works as above. If there are several threads sharing a read lock, however, the inheritance only works on one thread at a time.
Thundering Herd
When a resource is freed, all threads awaiting that resource are woken. This results in a footrace to obtain access to that object; one succeeds and the others return to sleep. This can lead to wasted overhead for context switches, as well as a problem with lower priority threads obtaining access to an object before a higher-priority thread. This is called a "thundering herd" problem.
Priority inheritance is an attempt to deal with this problem, but some types of syncronization do not use inheritance.
Turnstiles
Each syncronization object (lock) contains a pointer to a structure known as a turnstile. These contain the data needed to manipulate the syncronization object, such as a queue of blocked threads and a pointer to the thread that is currently using the resource. Turnstiles are dynamically allocated based on the number of allocated threads on the system. A turnstile is allocated by the first thread that blocks on a resource and is freed when no more threads are blocked on the resource.
Turnstiles queue the blocked threads according to their priority. Turnstiles may issue a signal to wake up the highest-priority thread, or they may issue a broadcast to wake up all sleeping threads.
Adjusting Priorities
The priority of a process can be adjusted with priocntl or nice, and the priority of an LWP can be controlled with priocntl().
Real Time Issues
STREAMS processing is moved into its own kernel threads, which run at a lower priority than RT threads. If an RT thread places a STREAMS request, it may be serviced at a lower priority level than is merited.
Real time processes also lock all their pages in memory. This can cause problems on a system that is underconfigured for the amount of memory that is required.
Since real time processes run at such a high priority, system daemons may suffer if the real time process does not permit them to run.
When a real time process forks, the new process also inherits real time privileges. The programmer must take care to prevent unintended consequences. Loops can also be hard to stop, so the programmer also needs to make sure that the program does not get caught in an infinite loop.
Interrupts
Interrupt levels run between 0 and 15. Some typical interrupts include:
• soft interrupts
• SCSI/FC disks (3)
• Tape, Ethernet
• Video/graphics
• clock() (10)
• serial communications
• real-time CPU clock
• Nonmaskable interrupts (15)

Process Virtual Memory
Each process maps either 2^32 or 2^44 bytes of memory (depending on whether the OS is running in 32 or 64-bit mode). This works out to 4GB or 16TB. Not all of this memory is allocated (used); the virtual memory is used as address space that can be mapped to actual memory resources.
Virtual memory structure for a process can be described as in this diagram:
Kernel

While this is part of the address space, it cannot be addressed directly by the process. It must be addressed via system calls.
Stack

Used by the program for variables and storage. It grows and shrinks in size depending on what routines are called and what their stack space requirements are. It is normally about 8MB in size.
Shared Libraries
Shared libraries are position independent so that they can be shared by all programs that want to use them. One common example is libc.so.
hole

This is the address space that is unallocated and unused. It does not tie up physical memory. For most processes, this is the largest portion of the virtual memory for the process.
heap

Used for some types of working storage. It is allocated by the malloc function.
BSS

Uninitialized variables. These are not part of the executable file and their initial value is set to zeros.
Data

Global variables, constants, static variables from the program.
Text

Set of instructions from the compiler-generated executable file.
The virtual memory map for a process can be displayed using the pmap command.
Additional information is available on the Processes page.
CDE Configuration
Please note that the below assumes that the user is using csh, which is far and away the most common shell in our environment.
If CDE fails, the user should attempt to log in under command line mode. That way, the user can make the appropriate edits to his/her environment to correct whatever configuration issues are present. (If the user is unable to log in under command line mode, something more serious is wrong with his/her account.) Command line mode is accessible under the "options" button of the dtlogin screen.
The quickest way to get someone to a useable CDE configuration is via the following steps in the user's home directory:
• mv .dtprofile .dtprofile.old
• mv .dt .dt.old
• /usr/princeton/bin/updatedots
The above steps will replace the user's environment with the default environment upon the next dtlogin attempt. This means that any customizations that have been added to the user's environment will be lost. This option is the best one for new or inexperienced users.
CDE failures are most often a problem with the LD_LIBRARY_PATH. There are some incompatibilities between the X libraries in /usr/princeton/lib and those in /usr/openwin/lib and /usr/dt/lib. The system-specific libraries should come first in the LD_LIBRARY_PATH. For example:
setenv LD_LIBRARY_PATH /usr/openwin/lib:/usr/dt/lib:${LD_LIBRARY_PATH}

If this is the problem, the above line (or a logical equivalent) should be added in the user's .login file.
Another problem exists in older versions of the Princeton default .login file. If dtlogin attempts to source this file, and one of the msgs commands is enabled without any checking, the dtlogin session will freeze or crash. The msgs commands can be bracketed with an "if" statement, as is done in the current default .login file:
if ( ! ${?DT} ) then
msgs -p
endif
Alternatively, the user's .dtprofile can be instructed to ignore the .login file by commenting out the following line:
DTSOURCEPROFILE=true