Matt Blaze
Department of Computer Science Princeton
University mab@cs.princeton.edu
ABSTRACT
Traces of filesystem activity have proven
to be useful for a wide variety of
purposes, rang ing from quantitative
analysis of system behavior to
trace-driven simulation of filesystem algo
rithms. Such traces can be
difficult to obtain, however, usually
entailing modification of the
filesystems to be monitored and runtime
overhead for the period of the
trace. Largely because of these
difficulties, a surprisingly small number of
filesystem traces have been conducted, and
few sample workloads are
available to filesystem researchers.
This paper describes a portable toolkit for
deriving approximate traces of
NFS [1] activity by non-intrusively
monitoring the Ethernet traffic to and
from the file server. The toolkit uses a
promiscuous Ethernet listener
interface (such as the Packetfilter[2]) to
read and reconstruct NFS-related
RPC packets intended for the server. It
produces traces of the NFS activity
as well as a plausible set of corresponding
client system calls. The tool is
currently in use at Princeton and other
sites, and is available via
anonymous ftp.
1. Motivation
Traces of real workloads form an important
part of virtually all analysis of
computer system behavior, whether it is
program hot spots, memory access
patterns, or filesystem activity that is
being studied. In the case of
filesystem activity, obtaining useful
traces is particularly challenging.
Filesystem behavior can span long time
periods, often making it necessary to
collect huge traces over weeks or even
months. Modification of the
filesystem to collect trace data is often
difficult, and may result in
unacceptable runtime overhead. Distributed
filesystems exa cerbate these
difficulties, especially when the network
is composed of a large number of
heterogeneous machines. As a result of
these difficulties, only a relatively
small number of traces of Unix filesystem
workloads have been conducted,
primarily in computing research
environments. [3], [4] and [5] are examples
of such traces.
Since distributed filesystems work by
transmitting their activity over a
network, it would seem reasonable to obtain
traces of such systems by
placing a "tap" on the network
and collecting trace data based on the
network traffic. Ethernet[6] based networks
lend themselves to this approach
particularly well, since traffic is
broadcast to all machines connected to a
given subnetwork. A number of
general-purpose network monitoring tools are
avail able that "promiscuously"
listen to the Ethernet to which they are
connected; Sun's etherfind[7] is an example
of such a tool. While these
tools are useful for observing (and
collecting statistics on) specific types
of packets, the information they provide is
at too low a level to be useful
for building filesystem traces. Filesystem
operations may span several
packets, and may be meaningful only in the
context of other, previous
operations.
Some work has been done on characterizing
the impact of NFS traffic on
network load. In [8], for example, the
results of a study are reported in
which Ethernet traffic was monitored and
statistics gathered on NFS
activity. While useful for understanding
traffic patterns and developing a
queueing model of NFS loads, these previous
stu dies do not use the network
traffic to analyze the file access traffic
patterns of the system, focusing
instead on developing a statistical model
of the individual packet sources,
destinations, and types.
This paper describes a toolkit for
collecting traces of NFS file access
activity by monitoring Ethernet traffic. A
"spy" machine with a promiscuous
Ethernet interface is connected to the same
network as the file server. Each
NFS-related packet is analyzed and a trace
is produced at an appropriate
level of detail. The tool can record the
low level NFS calls themselves or
an approximation of the user-level system
calls (open, close, etc.) that
triggered the activity.
We partition the problem of deriving NFS
activity from raw network traffic
into two fairly distinct subprob lems: that
of decoding the low-level NFS
operations from the packets on the network,
and that of translating these
low-level commands back into user-level
system calls. Hence, the toolkit
consists of two basic parts, an "RPC
decoder" (rpcspy) and the "NFS
analyzer" (nfstrace). rpcspy
communicates with a low-level network
monitoring facility (such as Sun's NIT [9]
or the Packetfilter [2]) to read
and reconstruct the RPC transactions (call
and reply) that make up each NFS
command. nfstrace takes the output of
rpcspy and reconstructs the sys tem
calls that occurred as well as other
interesting data it can derive about
the structure of the filesystem, such as
the mappings between NFS file
handles and Unix file names. Since there is
not a clean one-to-one mapping
between system calls and lower-level NFS
commands, nfstrace uses some simple
heuristics to guess a reasonable
approximation of what really occurred.
1.1. A Spy's View of the NFS Protocols
It is well beyond the scope of this paper
to describe the protocols used by
NFS; for a detailed description of how NFS
works, the reader is referred to
[10], [11], and [12]. What follows is a
very brief overview of how NFS
activity translates into Ethernet packets.
An NFS network consists of servers, to
which filesystems are physically
connected, and clients, which per form
operations on remote server
filesystems as if the disks were locally
connected. A particular machine can
be a client or a server or both. Clients
mount remote server filesystems in
their local hierarchy just as they do local
filesystems; from the user's
perspective, files on NFS and local
filesystems are (for the most part)
indistinguishable, and can be manipulated
with the usual filesystem calls.
The interface between client and server is
defined in terms of 17 remote
procedure call (RPC) operations. Remote
files (and directories) are referred
to by a file handle that uniquely
identifies the file to the server. There
are operations to read and write bytes of a
file (read, write), obtain a
file's attributes (getattr), obtain the
contents of directories (lookup,
readdir), create files (create), and so
forth. While most of these
operations are direct analogs of Unix
system calls, notably absent are open
and close operations; no client state
information is maintained at the
server, so there is no need to inform the
server explicitly when a file is
in use. Clients can maintain buffer cache
entries for NFS files, but must
verify that the blocks are still valid (by
checking the last write time with
the getattr operation) before using the
cached data.
An RPC transaction consists of a call
message (with arguments) from the
client to the server and a reply mes sage
(with return data) from the server
to the client. NFS RPC calls are
transmitted using the UDP/IP connection
less unreliable datagram protocol[13]. The
call message contains a unique
transaction identifier which is included in
the reply message to enable the
client to match the reply with its call.
The data in both messages is
encoded in an "external data
representation" (XDR), which provides a
machine-independent standard for byte
order, etc.
Note that the NFS server maintains no state
information about its clients,
and knows nothing about the context of each
operation outside of the
arguments to the operation itself.
2. The rpcspy Program
rpcspy is the interface to the
system-dependent Ethernet monitoring
facility; it produces a trace of the RPC
calls issued between a given set of
clients and servers. At present, there are
versions of rpcspy for a number
of BSD-derived systems, including ULTRIX
(with the Packetfilter[2]), SunOS
(with NIT[9]), and the IBM RT running AOS
(with the Stanford enet filter).
For each RPC transaction monitored, rpcspy
produces an ASCII record
containing a timestamp, the name of the
server, the client, the length of
time the command took to execute, the name
of the RPC command executed, and
the command- specific arguments and return
data. Currently, rpcspy
understands and can decode the 17 NFS RPC
commands, and there are hooks to
allow other RPC services (for example, NIS)
to be added reasonably easily.
The output may be read directly or piped
into another program (such as
nfstrace) for further analysis; the for mat
is designed to be reasonably
friendly to both the human reader and other
programs (such as nfstrace or
awk).
Since each RPC transaction consists of two
messages, a call and a reply,
rpcspy waits until it receives both these
components and emits a single
record for the entire transaction. The
basic output format is 8 vertical-bar
separated fields:
timestamp | execution-time | server |
client | command-name | arguments |
reply-data
where timestamp is the time the reply
message was received, execution-time
is the time (in microseconds) that elapsed
between the call and reply,
server is the name (or IP address) of the
server, client is the name (or IP
address) of the client followed by the
userid that issued the command,
command-name is the name of the particular
program invoked (read, write,
getattr, etc.), and arguments and
reply-data are the command dependent
arguments and return values passed to and
from the RPC program,
respectively.
The exact format of the argument and reply
data is dependent on the specific
command issued and the level of detail the
user wants logged. For example, a
typical NFS command is recorded as follows:
690529992.167140 | 11717 | paramount |
merckx.321 | read |
{"7b1f00000000083c", 0, 8192} |
ok, 1871
In this example, uid 321 at client
"merckx" issued an NFS read command to
server "paramount". The reply was
issued at (Unix time) 690529992.167140
seconds; the call command occurred 11717
microseconds earlier. Three
arguments are logged for the read call: the
file handle from which to read
(represented as a hexadecimal string), the
offset from the beginning of the
file, and the number of bytes to read. In
this example, 8192 bytes are
requested starting at the beginning (byte
0) of the file whose handle is
"7b1f00000000083c". The command
completed successfully (status "ok"), and
1871 bytes were returned. Of course, the
reply message also included the
1871 bytes of data from the file, but that
field of the reply is not logged
by rpcspy.
rpcspy has a number of configuration
options to control which hosts and RPC
commands are traced, which call and reply
fields are printed, which Ethernet
interfaces are tapped, how long to wait for
reply messages, how long to run,
etc. While its primary function is to
provide input for the nfstrace program
(see Section 3), judi cious use of these
options (as well as such programs
as grep, awk, etc.) permit its use as a
simple NFS diag nostic and
performance monitoring tool. A few screens
of output give a surprisingly
informative snapshot of current NFS
activity; we have identified quickly
using the program several problems that
were otherwise difficult to
pinpoint. Similarly, a short awk script can
provide a breakdown of the most
active clients, servers, and hosts over a
sampled time period.
2.1. Implementation Issues
The basic function of rpcspy is to monitor
the network, extract those
packets containing NFS data, and print the
data in a useful format. Since
each RPC transaction consists of a call and
a reply, rpcspy maintains a
table of pending call packets that are
removed and emitted when the matching
reply arrives. In normal operation on a
reasonably fast workstation, this
rarely requires more than about two
megabytes of memory, even on a busy net
work with unusually slow file servers.
Should a server go down, however, the
queue of pending call messages (which are
never matched with a reply) can
quickly become a memory hog; the user can
specify a maximum size the table
is allowed to reach before these
"orphaned" calls are searched out and
reclaimed.
File handles pose special problems. While
all NFS file handles are a fixed
size, the number of significant bits varies
from implementation to
implementation; even within a vendor, two
different releases of the same
operating system might use a completely
different internal handle format. In
most Unix implementations, the handle
contains a filesystem identifier and
the inode number of the file; this is
sometimes augmented by additional
information, such as a version number.
Since programs using rpcspy output
generally will use the handle as a unique
file identifier, it is important
that there not appear to be more than one
handle for the same file.
Unfortunately, it is not sufficient to
simply consider the handle as a
bitstring of the maximum handle size, since
many operating systems do not
zero out the unused extra bits before
assigning the handle. Fortunately,
most servers are at least consistent in the
sizes of the handles they
assign. rpcspy allows the user to specify
(on the command line or in a
startup file) the handle size for each host
to be monitored. The handles
from that server are emitted as hexadecimal
strings truncated at that
length. If no size is specified, a guess is
made based on a few common
formats of a reasonable size.
It is usually desirable to emit IP
addresses of clients and servers as their
symbolic host names. An early ver sion of
the software simply did a
nameserver lookup each time this was
necessary; this quickly flooded the
network with a nameserver request for each
NFS transaction. The current
version maintains a cache of host names;
this requires a only a modest
amount of memory for typical networks of
less than a few hundred hosts. For
very large networks or those where NFS
service is provided to a large number
of remote hosts, this could still be a
potential problem, but as a last
resort remote name resolution could be
disabled or rpcspy configured to not
translate IP addresses.
UDP/IP datagrams may be fragmented among several
packets if the datagram is
larger than the maximum size of a single
Ethernet frame. rpcspy looks only
at the first fragment; in practice,
fragmentation occurs only for the data
fields of NFS read and write transactions,
which are ignored anyway.
3. nfstrace: The Filesystem Tracing Package
Although rpcspy provides a trace of the
low-level NFS commands, it is not,
in and of itself, sufficient for obtaining
useful filesystem traces. The
low-level commands do not by themselves
reveal user-level activity. Furth
ermore, the volume of data that would need
to be recorded is potentially
enormous, on the order of megabytes per
hour. More useful would be an
abstraction of the user-level system calls
underlying the NFS activity.
nfstrace is a filter for rpcspy that
produces a log of a plausible set of
user level filesystem commands that could
have triggered the monitored
activity. A record is produced each time a
file is opened, giving a summary
of what occurred. This summary is detailed
enough for analysis or for use as
input to a filesystem simulator.
The output format of nfstrace consists of 7
fields:
timestamp | command-time | direction |
file-id | client | transferred | size
where timestamp is the time the open
occurred, command-time is the length of
time between open and close, direc tion is
either read or write (mkdir and
readdir count as write and read,
respectively). file-id identifies the
server and the file handle, client is the
client and user that performed the
open, transferred is the number of bytes of
the file actually read or
written (cache hits have a 0 in this
field), and size is the size of the
file (in bytes).
An example record might be as follows:
690691919.593442 | 17734 | read |
basso:7b1f00000000400f | frejus.321 | 0 |
24576
Here, userid 321 at client frejus read file
7b1f00000000400f on server
basso. The file is 24576 bytes long and was
able to be read from the client
cache. The command started at Unix time
690691919.593442 and took 17734
microseconds at the server to execute.
Since it is sometimes useful to know the
name corresponding to the handle
and the mode information for each file,
nfstrace optionally produces a map
of file handles to file names and modes.
When enough information (from
lookup and readdir commands) is received,
new names are added. Names can
change over time (as files are deleted and
renamed), so the times each
mapping can be considered valid is recorded
as well. The mapping infor
mation may not always be complete, however,
depending on how much activity
has already been observed. Also, hard links
can confuse the name mapping,
and it is not always possible to determine
which of several possible names a
file was opened under.
What nfstrace produces is only an
approximation of the underlying user
activity. Since there are no NFS open or
close commands, the program must
guess when these system calls occur. It
does this by taking advantage of the
observation that NFS is fairly consistent
in what it does when a file is
opened. If the file is in the local buffer
cache, a getattr call is made on
the file to verify that it has not changed
since the file was cached.
Otherwise, the actual bytes of the file are
fetched as they are read by the
user. (It is possible that part of the file
is in the cache and part is not,
in which case the getattr is performed and
only the missing pieces are
fetched. This occurs most often when a
demand-paged executable is loaded).
nfstrace assumes that any sequence of NFS
read calls on the same file issued
by the same user at the same client is part
of a single open for read. The
close is assumed to have taken place when
the last read in the sequence
completes. The end of a read sequence is
detected when the same client reads
the beginning of the file again or when a
timeout with no reading has
elapsed. Writes are handled in a similar
manner.
Reads that are entirely from the client
cache are a bit harder; not every
getattr command is caused by a cache read,
and a few cache reads take place
without a getattr. A user level stat system
call can sometimes trigger a
getattr, as can an ls -l command.
Fortunately, the attribute caching used by
most implementations of NFS seems to
eliminate many of these extraneous
getattrs, and ls commands appear to trigger
a lookup command most of the
time. nfstrace assumes that a getattr on
any file that the client has read
within the past few hours represents a
cache read, otherwise it is ignored.
This simple heuristic seems to be fairly
accurate in practice. Note also
that a getattr might not be performed if a
read occurs very soon after the
last read, but the time threshold is
generally short enough that this is
rarely a problem. Still, the cached reads
that nfstrace reports are, at
best, an estimate (generally erring on the
side of over-reporting). There is
no way to determine the number of bytes
actually read for cache hits.
The output of nfstrace is necessarily
produced out of chronological order,
but may be sorted easily by a
post-processor.
nfstrace has a host of options to control
the level of detail of the trace,
the lengths of the timeouts, and so on. To
facilitate the production of very
long traces, the output can be flushed and
checkpointed at a specified inter
val, and can be automatically compressed.
4. Using rpcspy and nfstrace for Filesystem
Tracing
Clearly, nfstrace is not suitable for
producing highly accurate traces;
cache hits are only estimated, the timing
information is imprecise, and data
from lost (and duplicated) network packets
are not accounted for. When such
a highly accurate trace is required, other
approaches, such as modification
of the client and server kernels, must be
employed.
The main virtue of the passive-monitoring
approach lies in its simplicity.
In [5], Baker, et al, describe a trace of a
distributed filesystem which
involved low-level modification of several
different operating system
kernels. In contrast, our entire filesystem
trace package consists of less
than 5000 lines of code written by a single
programmer in a few weeks,
involves no kernel modifications, and can
be installed to monitor multiple
heterogeneous servers and clients with no
knowledge of even what operating
systems they are running.
The most important parameter affecting the
accuracy of the traces is the
ability of the machine on which rpcspy is
running to keep up with the
network traffic. Although most modern RISC
workstations with reasonable
Ethernet interfaces are able to keep up
with typical network loads, it is
important to determine how much informa
tion was lost due to packet buffer
overruns before relying upon the trace
data. It is also important that the
trace be, indeed, non-intrusive. It quickly
became obvious, for example,
that logging the traffic to an NFS
filesystem can be problematic.
Another parameter affecting the usefulness
of the traces is the validity of
the heuristics used to translate from RPC
calls into user-level system
calls. To test this, a shell script was
written that performed ls -l, touch,
cp and wc commands randomly in a small
directory hierarchy, keeping a record
of which files were touched and read and at
what time. After several hours,
nfstrace was able to detect 100% of the
writes, 100% of the uncached reads,
and 99.4% of the cached reads. Cached reads
were over-reported by 11%, even
though ls com mands (which cause the
"phantom" reads) made up 50% of the
test activity. While this test provides
encouraging evidence of the accuracy
of the traces, it is not by itself
conclusive, since the particular workload
being monitored may fool nfstrace in
unanticipated ways.
As in any research where data are collected
about the behavior of human
subjects, the privacy of the individu als
observed is a concern. Although
the contents of files are not logged by the
toolkit, it is still possible to
learn something about individual users from
examining what files they read
and write. At a minimum, the users of a mon
itored system should be informed
of the nature of the trace and the uses to
which it will be put. In some
cases, it may be necessary to disable the
name translation from nfstrace
when the data are being provided to others.
Commercial sites where filenames
might reveal something about proprietary
projects can be particularly
sensitive to such concerns.
5. A Trace of Filesystem Activity in the
Princeton C.S. Department
A previous paper[14] analyzed a five-day
long trace of filesystem activity
conducted on 112 research worksta tions at
DEC-SRC. The paper identified a
number of file access properties that
affect filesystem caching perfor
mance; it is difficult, however, to know
whether these properties were
unique artifacts of that particular
environment or are more generally
applicable. To help answer that question, it
is necessary to look at similar
traces from other computing environments.
It was relatively easy to use rpcspy and
nfstrace to conduct a week long
trace of filesystem activity in the
Princeton University Computer Science
Department. The departmental computing
facility serves a community of
approximately 250 users, of which about 65%
are researchers (faculty,
graduate students, undergraduate
researchers, postdoctoral staff, etc), 5%
office staff, 2% systems staff, and the
rest guests and other "external"
users. About 115 of the users work
full-time in the building and use the
system heavily for electronic mail,
netnews, and other such communication
services as well as other computer science
research oriented tasks (editing,
compiling, and executing programs,
formatting documents, etc).
The computing facility consists of a
central Auspex file server (fs) (to
which users do not ordinarily log in
directly), four DEC 5000/200s (elan,
hart, atomic and dynamic) used as shared
cycle servers, and an assortment of
dedicated workstations (NeXT machines, Sun
workstations, IBM-RTs, Iris
workstations, etc.) in indi vidual offices
and laboratories. Most users log
in to one of the four cycle servers via X
window terminals located in
offices; the terminals are divided evenly
among the four servers. There are
a number of Ethernets throughout the
building. The central file server is
connected to a "machine room
network" to which no user terminals are
directly connected; traffic to the file
server from outside the machine room
is gatewayed via a Cisco router. Each of
the four cycle servers has a local
/, /bin and /tmp filesystem; other
filesystems, including /usr, /usr/local,
and users' home directories are NFS mounted
from fs. Mail sent from local
machines is delivered locally to the
(shared) fs:/usr/spool/mail; mail from
outside is delivered directly on fs.
The trace was conducted by connecting a
dedicated DEC 5000/200 with a local
disk to the machine room net work. This
network carries NFS traffic for all
home directory access and access to all
non-local cycle-server files
(including the most of the actively-used
programs). On a typical weekday,
about 8 million packets are transmitted
over this network. nfstrace was
configured to record opens for read and
write (but not directory accesses or
individual reads or writes). After one week
(wednesday to wednesday),
342,530 opens for read and 125,542 opens
for write were recorded, occupying
8 MB of (compressed) disk space. Most of
this traffic was from the four
cycle servers.
No attempt was made to
"normalize" the workload during the trace period.
Although users were notified that file
accesses were being recorded, and
provided an opportunity to ask to be
excluded from the data collection, most
users seemed to simply continue with their
normal work. Similarly, no
correction is made for any anomalous user
activity that may have occurred
during the trace.
5.1. The Workload Over Time
Intuitively, the volume of traffic can be
expected to vary with the time of
day. Figure 1 shows the number of reads and
writes per hour over the seven
days of the trace; in particular, the
volume of write traffic seems to
mirror the general level of departmental
activity fairly closely.
An important metric of NFS performance is
the client buffer cache hit rate.
Each of the four cycle servers allocates
approximately 6MB of memory for the
buffer cache. The (estimated) aggregate hit
rate (percentage of reads served
by client caches) as seen at the file
server was surprisingly low: 22.2%
over the entire week. In any given hour,
the hit rate never exceeded 40%.
Figure 2 plots (actual) server reads and
(estimated) cache hits per hour
over the trace week; observe that the hit
rate is at its worst during
periods of the heaviest read activity.
Past studies have predicted much higher hit
rates than the aggregate
observed here. It is probable that since
most of the traffic is generated by
the shared cycle servers, the low hit rate
can be attributed to the large
number of users competing for cache space.
In fact, the hit rate was
observed to be much higher on the
single-user worksta tions monitored in the
study, averaging above 52% overall. This
suggests, somewhat
counter-intuitively, that if more computers
were added to the network (such
that each user had a private workstation),
the server load would decrease
considerably. Figure 3 shows the actual
cache misses and estimated cache
hits for a typical private works tation in
the study.
Thu 00:00
Thu 06:00 Thu 12:00 Thu 18:00
Fri 00:00 Fri 06:00 Fri 12:00
Fri 18:00 Sat 00:00 Sat 06:00 Sat 12:00 Sat
18:00 Sun 00:00 Sun 06:00 Sun
12:00 Sun 18:00 Mon 00:00 Mon 06:00 Mon
12:00 Mon 18:00 Tue 00:00 Tue 06:00 Tue 12:00 Tue 18:00 Wed 00:00 Wed 06:00 Wed
12:00 Wed 18:00
1000
2000
3000
4000
5000
6000
Reads/Writes per hour
Writes
Reads (all)
Figure 1 - Read and Write Traffic Over Time
5.2. File Sharing
One property observed in the DEC-SRC trace
is the tendency of files that are
used by multiple workstations to make up a
significant proportion of read
traffic but a very small proportion of
write traffic. This has important
implications for a caching strategy, since,
when it is true, files that are
cached at many places very rarely need to
be invalidated. Although the
Princeton computing facility does not have
a single workstation per user, a
similar metric is the degree to which files
read by more than one user are
read and written. In this respect, the
Princeton trace is very similar to
the DEC-SRC trace. Files read by more than
one user make up more than 60% of
read traffic, but less than 2% of write
traffic. Files shared by more than
ten users make up less than .2% of write
traffic but still more than 30% of
read traffic. Figure 3 plots the number of
users who have previously read
each file against the number of reads and
writes.
5.3. File "Entropy"
Files in the DEC-SRC trace demonstrated a
strong tendency to "become"
read-only as they were read more and more
often. That is, the probability
that the next operation on a given file
will overwrite the file drops off
shar ply in proportion to the number of
times it has been read in the past.
Like the sharing property, this has
implications for a caching strategy,
since the probability that cached data is
valid influences the choice of a
validation scheme. Again, we find this
property to be very strong in the
Princeton trace. For any file access in the
trace, the probability that it
is a write is about 27%. If the file has
already been read at least once
since it was last written to, the write
probability drops to 10%. Once the
file has been read at least five times, the
write probability drops below
1%. Fig ure 4 plots the observed write
probability against the number of
reads since the last write.
Thu 00:00
Thu 06:00 Thu 12:00 Thu 18:00
Fri 00:00 Fri 06:00 Fri 12:00
Fri 18:00 Sat 00:00 Sat 06:00 Sat 12:00 Sat
18:00 Sun 00:00 Sun 06:00 Sun
12:00 Sun 18:00 Mon 00:00 Mon 06:00 Mon
12:00 Mon 18:00 Tue 00:00 Tue 06:00
Tue 12:00 Tue 18:00 Wed 00:00 Wed 06:00 Wed
12:00 Wed 18:00
1000
2000
3000
4000
5000
Total reads per hour
Cache Hits (estimated)
Cache Misses (actual)
Figure 2 - Cache Hits and Misses Over Time
6. Conclusions
Although filesystem traces are a useful
tool for the analysis of current and
proposed systems, the difficulty of
collecting meaningful trace data makes
such traces difficult to obtain. The
performance degradation introduced by
the trace software and the volume of raw
data generated makes traces over
long time periods and outside of comput ing
research facilities particularly
hard to conduct.
Although not as accurate as direct,
kernel-based tracing, a passive network
monitor such as the one described in this
paper can permit tracing of
distributed systems relatively easily. The
ability to limit the data
collected to a high-level log of only the
data required can make it
practical to conduct traces over several
months. Such a long term trace is
presently being conducted at Princeton as
part of the author's research on
filesystem caching. The non-intrusive
nature of the data collection makes
traces possible at facilities where kernel
modification is impracti cal or
unacceptable.
It is the author's hope that other sites
(particularly those not doing
computing research) will make use of this
toolkit and will make the traces
available to filesystem researchers.
7. Availability
The toolkit, consisting of rpcspy,
nfstrace, and several support scripts,
currently runs under several BSD-derived
platforms, including ULTRIX 4.x,
SunOS 4.x, and IBM-RT/AOS. It is available
for anonymous ftp over the
Internet from samadams.princeton.edu, in
the compressed tar file
nfstrace/nfstrace.tar.Z.
Thu 00:00
Thu 06:00 Thu 12:00 Thu 18:00
Fri 00:00 Fri 06:00 Fri 12:00
Fri 18:00 Sat 00:00 Sat 06:00 Sat 12:00 Sat
18:00 Sun 00:00 Sun 06:00 Sun
12:00 Sun 18:00 Mon 00:00 Mon 06:00 Mon
12:00 Mon 18:00 Tue 00:00 Tue 06:00
Tue 12:00 Tue 18:00 Wed 00:00 Wed 06:00 Wed
12:00 Wed 18:00 0
100
200
300
Reads per hour
Cache Hits (estimated)
Cache Misses (actual)
Figure 3 - Cache Hits and Misses Over Time
- Private Workstation
0 5 10 15 20
n (readers)
0
20
40
60
80
100
% of Reads and Writes used by > n users
Reads
Writes
Figure 4 - Degree of Sharing for Reads and
Writes
0 5 10 15 20
Reads Since Last Write
0.0
0.1
0.2
P(next operation is write)
Figure 5 - Probability of Write Given >=
n Previous Reads
8. Acknowledgments
The author would like to gratefully
acknowledge Jim Roberts and Steve Beck
for their help in getting the trace machine
up and running, Rafael Alonso
for his helpful comments and direction, and
the members of the pro gram
committee for their valuable suggestions.
Jim Plank deserves special thanks
for writing jgraph, the software which
produced the figures in this paper.
9. References
[1] Sandberg, R., Goldberg, D., Kleiman,
S., Walsh, D., & Lyon, B. "Design
and Implementation of the Sun Net work File
System." Proc. USENIX, Summer,
1985.
[2] Mogul, J., Rashid, R., & Accetta,
M. "The Packet Filter: An Efficient
Mechanism for User-Level Network
Code." Proc. 11th ACM Symp. on Operating
Systems Principles, 1987.
[3] Ousterhout J., et al. "A
Trace-Driven Analysis of the Unix 4.2 BSD File
System." Proc. 10th ACM Symp. on
Operating Systems Principles, 1985.
[4] Floyd, R. "Short-Term File
Reference Patterns in a UNIX Environment,"
TR-177 Dept. Comp. Sci, U. of Rochester,
1986.
[5] Baker, M. et al. "Measurements of
a Distributed File System," Proc. 13th
ACM Symp. on Operating Systems Principles,
1991.
[6] Metcalfe, R. & Boggs, D.
"Ethernet: Distributed Packet Switching for
Local Computer Networks," CACM July,
1976.
[7] "Etherfind(8) Manual Page,"
SunOS Reference Manual, Sun Microsystems,
1988.
[8] Gusella, R. "Analysis of Diskless
Workstation Traffic on an Ethernet,"
TR-UCB/CSD-87/379, University Of
California, Berkeley, 1987.
[9] "NIT(4) Manual Page," SunOS
Reference Manual, Sun Microsystems, 1988.
[10] "XDR Protocol
Specification," Networking on the Sun Workstation, Sun
Microsystems, 1986.
[11] "RPC Protocol
Specification," Networking on the Sun Workstation, Sun
Microsystems, 1986.
[12] "NFS Protocol
Specification," Networking on the Sun Workstation, Sun
Microsystems, 1986.
[13] Postel, J. "User Datagram
Protocol," RFC 768, Network Information
Center, 1980.
[14] Blaze, M., and Alonso, R.,
"Long-Term Caching Strategies for Very Large
Distributed File Systems," Proc.
Summer 1991 USENIX, 1991.
Matt Blaze is a Ph.D. candidate in Computer
Science at Princeton University,
where he expects to receive his degree in
the Spring of 1992. His research
interests include distributed systems,
operating systems, databases, and
programming environments. His current
research focuses on caching in very
large distributed filesys tems. In 1988 he
received an M.S. in Computer
Science from Columbia University and in
1986 a B.S. from Hunter College. He
can be reached via email at
mab@cs.princeton.edu or via US mail at Dept. of
Computer Science, Princeton University, 35
Olden Street, Princeton NJ
08544.