1.0 Introduction

An interactive television program we can play at home; virtual reality; a video conference on a computer. These are all examples of what is known today as multimedia: the use of sound, pictures and text to create a new medium of communication. Multimedia applications challenge today’s hardware and software because of their need to transfer large amounts of data at high speed. The information is usually stored on a machine other than the one displaying the data. Moreover, most multimedia applications need to be able to control the speed at which the data arrives in order to display it correctly. Satisfying the need for large amounts of data transferred at bursty interactively controllable rates stretches the limits of contemporary technology.

The speed at which data needs to be transferred from disk exceeds the speed of most disks that are used in contemporary computers. Often one must resort to specialized hardware to meet the need for this kind of speed. Since in a distributed environment the data is usually not stored on the computer that needs it, it has to be transferred across some sort of network. Again, the speed that is required for this action can be far in excess of what most networks can deliver. The disk technology that we have today allows for the transfer of only about 24Mbits of information per second. Full motion NTSC video requires bandwidth on the order of at least 45Mbits/sec [24], twice what a disk by itself can deliver to the application, even if the disk is dedicated to the application. One approach that is being used is dividing the data up across several disks on a workstation to allow parallel reading across the disks. This is known as disk striping, a variant of which is also known as RAID. However this alone does not solve the problem, as the International Radio Consultive Committee recommends that video objects have a bandwidth of 216 Mbits/sec [23], which requires more disks then a single workstation can accommodate without resorting to the use of special hardware. And then video based on HDTV (High Definition Television) would require on the order of 700 Mbits/sec. Another alternative is to compress the data on disk; there are several well-known compression algorithms in wide use for audio and video data storage. However, this second option also has some disadvantages, the quality of the data usually must be sacrificed in the process if any significant gain in compression is to be made. This kind of compression is commonly called "lossy" compression because of the loss of data. Some applications cannot work with data that has been altered in any way. For example, human gene data, compressed with such a lossy algorithm, is not exactly what it was before the compression and any representation of that data will have "artifacts" of the compression algorithm. Any conclusions drawn from such data are subject to distortion from these artifacts. In a project such as the mapping of every gene in the human body, such artifacts are unacceptable as input data for analysis. Therefore, one can see that certain applications’ data needs both exceed the bandwidth available through RAID and also require high-precision data that cannot be compressed. These applications require another means of retrieving data.

The alternative to a dedicated system overloaded with disks is to make a clear separation between the application and the data storage system. Such an alternative is pursued here, a data storage and retrieval system that is distributed over several workstations, each with several disks, all interconnected by a high speed network, that can respond to requests for the data in "real-time". Real-time here refers to the continuous retrieval of data at the bandwidth required by that data type. Here, the limit on disk bandwidth is overcome by using several hosts, each managing several disks, and using high speed nets to aggregate the data that must be delivered to the application. The data can be stored at geographically distant locations, without affecting (in fact, without the knowledge of) the application. The application simply requests the needed data.

Another multimedia application is a terrain navigation application that displays satellite or aerial data that has been previously stored in a remote image data base. The application can be used to simulate navigation through that terrain. Such an application demonstrates all the problems associated with multimedia. It needs massive amounts of terrain data (accumulated by satellites and high altitude fly over). It also needs real-time response, as the "pilot" can turn in any direction and the data for the new perspective has to be ready for the application to display. The data for such a display should not have artifacts from compression. It was the need to service this particular application that prompted the creation of the Image Server System.

The Image Server System (ISS) is a parallel distributed file server that supplies images quickly enough to support real-time viewing of image or video data. The ISS consists of several (4-5) typical Unix-based scientific workstations (i.e. Sun SparcStation Model 10-41), each with several (4-6) fast-SCSI disks on multiple (2-3) SCSI host adaptors. Each workstation is also equipped with an ATM (Asynchronous Transfer Mode) network interface card. By achieving a large amount of parallelism across approximately 20 disks, 10 host adaptors, and 5 network interfaces, throughput of 400 Mbits/sec (50MBytes/sec) has been achieved using relatively low-cost, "off the shelf" components. The ISS is designed so that the entire system is scalable, may consist of heterogenous hosts, and may be geographically distributed.

1.1 Sample Applications

The following sections explore some of the other types of multimedia applications that could benefit from the ISS.

Image Browser

There are satellite pictures of the entire surface of most of the closer planets, not to mention the enormous number of images of the entire surface of the earth. These amount to terabytes of image data that needs to be viewed in some coherent sense. An image browser is an application with which one can view this data without specialized knowledge of the UTM (Universal Time Measurement) coordinates of the particular place to be studied. One can view hundreds of megabytes of data at a very low resolution looking for interesting features that one might wish to explore, or "zoom" in by requesting a higher resolution of an interesting place.

The human gene mapping project is another database for which an image browser could be appropriate. The project creates hundreds of megabytes of information regarding the interrelationship of DNA molecules. An image browser would allow researchers to view this voluminous data looking for particular relationships.

Using the image browser seems very much like watching a TV to the user, with the exception that the user has some control over what is shown. An image browser supplied by an ISS would allow people to view the results of almost anything that can be graphically displayed, without the previous limitations associated with the amount of data to be viewed. Such limitations include the amount of memory on their workstation, disk or other secondary storage capacity, or whether the data must be fragmented to view it. The problem of storage is also simplified as the data only has to be in one location: each person interested in the data is not responsible for moving it to some local location before viewing it. Several users can also view the data simultaneously.

Terrain Visualization Tool

Since the ISS is an high speed distributed storage server capable of delivering large amounts of data at high speeds, it is not surprising that applications such as terrain visualization make ideal ISS clients. Terrain Visualization tools needs both a high bandwidth and a low latency response time, and the sequence of requests is unpredictable. The need for high bandwidth comes from flying over the land. Most flight simulators use artificial data, but a terrain visualization tool uses actual images of land. The faster the flight the more data is needed to represent the ground being flown over. The need for low latency in response time comes from the need for the application to display the data as the user requests it. Moreover, the application should not have to pre-load the data into memory to view it.

The initial target application for the ISS is for use in the MAGIC (Multidimensional Applications and Gigabit Internetwork Consortium) project, which was established in June 1992 by the U. S. government’s Advanced Research Projects Agency (ARPA)[22]. MAGIC’s charter is to develop a high-speed, wide-area networking test bed that will demonstrate interactive exchange of data at gigabit-per-second rates among multiple distributed servers and clients using a terrain visualization application known as TerraVision. TerraVision will allow a user to navigate through and over a landscape created from aerial images of the U.S. Army National Training Center in Fort Irwin, CA. TerraVision is of interest to the U.S. Army because the ability of a commander to share a common view of the battlefield with his or her command is critical to effective command and control. TerraVision requires large amounts of data, transferred at both bursty and steady rates, and is limited mainly by network throughput as its major limiting factor. The ISS is to be used to supply image data at near gigabit rates to TerraVision.

Sample Medical Application

An example of a medical application which will probably benefit from the use of an ISS is the collection and playback of angiography images. Procedures used to restore coronary blood flow, though clinically effective, are expensive and have contributed significantly to the rising cost of medical care. To minimize the cost of such procedures, medical care providers are beginning to concentrate these services in a few high volume tertiary care centers. Patients are usually referred to these centers by cardiologists at their home facilities; the centers then must communicate the results back to the local cardiologists as soon as possible after the procedure.

The advantages of providing specialized services at distant tertiary centers are significantly reduced if the medical information obtained during the procedure is not delivered rapidly and accurately to the treating physician in the patient's home facility. The delivery systems currently used to transfer patient information between facilities include interoffice mail, U.S. mail, fax machine, telephone, and courier. Often these systems are inadequate and can introduce delays in patient care.

With an ATM network and a high speed image file server, still image and video sequences can be collected from the medical imaging systems. These images are sent through an ATM network to storage and analysis systems, as well as directly to the clinic sites. Thus, several modes of use are possible: data will be collected and stored for later use; data will be delivered live from the imaging device to remote clinics in real-time; or these data flows will all take place simultaneously. An experiment using the architecture is illustrated in

Figure 1, "Medical Application Architecture," on page 8, is being implemented. Whether the ISS servers are local or distributed around the network is entirely a function of the optimal logistics. There are plans in regional healthcare information systems for centralized facilities of this type, see [12].

1.2 Video Server

A video server is much like a image browser except that it is specialized to deal with movies. One can think of a movie as a sequence of frames or pictures, commonly referred to as a "stream". This data is very similar to the current data that the image browser requires and that the ISS supplies. The main difference between an image browser and a video server is that a video server needs to be able to be at several different indices in the stream simultaneously, since several people may be watching this data (or movie) at the same time, but each may be at a different point in the movie. Some may just be starting the movie, some just finishing and others at some intermediate point.

This need for simultaneous parallel access to the data stream affects the layout of the data on disks in the ISS. Since there is a predictable sequential pattern to the requests for data, the data is placed on a disk in a like manner, that is, in a sequential pattern. "Pre-fetching" data and storing it in the ISS cache measurably improves performance and is trivial to implement as the data is in a linear sequence. "Pre-fetching" is the caching of data on the ISS side that the application knows in advance it will request, but which the application doesn’t actually yet need (and which it may not have space to store). In this scenario the application will ask the ISS to "pre-fetch" the information, that is, read the information off of disk and store it in the memory cache on the ISS for subsequent fast access by the application.

An excellent example of this technology is SGI’s/Time Warners’ forthcoming video server[14]. Time Warner has joined forces with a cable provider to allow access to movies on-demand. What this means is that people will be able to watch the movie they want whenever they want to on their home televisions. All the movies will be stored at some central place: however, "central" here should not be taken to mean "on one machine", but rather implies a group of machines, or cluster, that do not have to be physically close to each other to act as a unit. When someone wants to view a movie he/she can just ask the server to start playing it on his/her TV at home. Communication with the server will most likely take place through the so-called "cable box". This technology will allow people to watch movies at their convenience without leaving home. It will also allow several people to be watching the same movie at the same time, each able to stop, rewind, or fast-forward it.

1.3 Large Scale Image Browsers

A large scale image browser is just an imager browser as previously described in Section 1.1.1, but used to look at extremely large images (on the order of Gigabytes). It has all the limitations mentioned in section1.1.1: it must be constantly aware of the amount of memory available on the host machine that it is running on, should the image that it is currently viewing grow larger then the amount of memory available, the image will not be displayed, or in some cases that host machine will crash, or other unpredictable results.

2.0 Previous Work in this Field

In the following sections some of the work that has been done in moving large amounts of data stored across either several disks or several hosts is discussed.

2.1 RAID (theory)

RAID (Redundant Array of Inexpensive Disks)[21] is a system that was developed by U.C. Berkeley in an effort to attain faster disk access times. RAID operates on the principle that parallel reads are faster then sequential reads. Data is not only distributed, but also duplicated across multiple disks. Using RAID it is theoretically possible to observe an increase in the speed of retrieval of data off of a disk in direct proportion to the number of disks that are used. A number of disks in a RAID configuration is called a "disk array". There are six different levels of RAID that are recognized by the RAID Advisory Board. These levels are defined as follows:

Level 0: Disk stripping, writes data across multiple disks simultaneously, doesn’t have parity so is not fault tolerant.

Level 1: Disk mirroring, duplicates data across two or more disks, has a fast read time, but slow write, uses twice as much space as what is stored, is very fault tolerant.

Level 2: Combination of Levels 0 & 1, with the addition of parity checking.

Level 3: Disk striping, strips data across the drives in the disk array and uses a separate disk for writing the parity too.

Level 4: Same as Level 3, with the exception that instead of using the disks in the array, it uses the blocks of the disks.

Level 5: Is the same as Level 3 with the exception that instead of writing the parity data to a separate disk, it is also striped across the disks with the regular data.

2.2 Striped Disks

Striping data across disks means that the data is broken up into "stripes" and laid out on disks. Stripes contain the file blocks that were recently written. An example of a disk striping storage system was designed but never implemented at UC Berkeley under the name Zebra. Zebra was similar in many ways to RAID. It differed only in that the data was striped across a networked system of disks instead of onto a single host’s disks as in RAID. Zebra also used some of the concepts of a log-structure file system (LFS) in conjunction with RAID ideas in the new implementation of a file system. One of the most notable concepts borrowed from LFS was the use of the "stripe fragment," which is a stripe of data that is written to the servers that exhibits temporal locality, rather than spatial locality. The stripe fragment will contain the file blocks that had been most recently written. What was unique was that the blocks did not all have to belong to a single file, but rather could be batches of blocks from different files put together. This strategy improved the overall performance by allowing fewer, but larger reads and writes, instead of the conventional numerous small writes found in other file systems.

2.3 Log Structured file system

A Log Structure file system (LFS) is a technique for storage of data on disk that breaks from some of the traditional methods. In most traditional file systems data is kept on disk in what is known as a random-access storage structure. In an LFS all data is kept on disk sequentially in what is called a log structure, and this is the only structure that resides on a disk. Each log is broken up farther into what are called log segments. The log is a record of all the information that is contained in the log segments, along with indices into the log segments pointing to the actual data. To accommodate the fact that there needs to be large amounts of free space to write to (for new files) there is a segment cleaner. A segment cleaner goes through the log segments and looks for live data, it then collects all this data in one contiguous area, either one or many segments depending on the size of the file, unfragmenting it. All the old segments that had previously contained the old data are marked as clean, so that the segments may be reused when the disk is next written to. The use of a log structure has several advantages: it increases the speed of writes to disk, and helps in crash recovery. There is currently an implementation of this file system in use at U.C. Berkeley, along with numerous papers on its implementation.

3.0 ISS Design

Here is a discussion of some of the design considerations that were explored during the initial design of the ISS.The most important considerations are that it be scalable and completely distributable across a heterogenous environment. The approach taken is to distribute "file system" functions across several processes, any of which can run on a remote host.

It is envisioned that the application will send requests for data (images, video, sound, etc.) to a master process. The master will perform a table look-up to determine the location of each data unit. Each data unit is uniquely indexed by server, disk and offset

. In Figure 3, "ISS Server Architecture," on page 16 one can see the function that the ISS provides to a terrain visualization application (TerraVision). Here one can see TerraVison taking a path through mapped terrain, it then requests from the master the tiles that will intersect the path. The requests that the master receives are then sorted into lists on a per server basis, and these lists are then sent to the individual servers. Each server then checks whether the data is already in its cache, and if it is not, it fetches the data from disk and transfers it to the cache. Once the data is in the cache, it is sent over the network to the requesting application.

The next section describes the basic software modules, their functionality, their relation to one another in the entirety of the ISS, and some of the terms and models that are used in the design of the ISS.

In Figure 4, "ISS Components," on page 19 one can see how all the different components of the ISS interact with each other. There is an ISS master process which receives the requests, and forwards the request to an ISS server. In each ISS server there is a process which reads in the request list and passes it to the cache manger. The cache manager is in charge of communicating any requests for data to the disk readers and passing data to the process that writes data to the network for delivery to the requesting application. In order to understand how the ISS works, one needs to be familiar with some of the common models of distributed and network applications as explained below.

3.1 Client-Server Model

The ISS is patterned after what is commonly referred to as a client-server model. In the client-sever model, the term "server" refers to the program that offers a service. Servers accept requests, perform the requested service, and return the result to the requester. The program that makes a request of the server is called a "client". Any program becomes a client as soon as it makes a request to a server, and is a client program only for as long as it is waiting for a response from that server. Once the client program has received the result and continues with no other input from the server, it is no longer considered a client of that server.

An example of client-server computing is commonly found in networking. The Domain Name System is a distributed data base that contains the information matching host names to IP addresses. When a program calls a function to find the name of a host from its IP addresses (an example of this might be the pair smithers.lbl.gov and 128.3.196.76) the program is considered a client of the database server that holds all of the network name translation tables. Once the program has received the information that it requires from the name server, it is no longer a client, as it no longer depends on the results of the server.

In the case of the ISS, the client is the application process that is requesting the data from the ISS and the server program is the server processes of the ISS itself.

3.2 Distributed Server Model

In a distributed server model the services that are offered are divided up among several different hosts on a network. Furthermore, as in the case of the ISS, each of the hosts can be of a different architecture. The precise division of services depends on the particular services offered and the resources available to the server.

For the ISS, increasing the number of disk servers increases the parallelization of data transfers, hence speed is increased. In the ISS, the server is divided into two distinct sections: the name server, which does a name translation, and the disk servers. The ISS disk server fetches data from disk in response to application requests. The amount of data stored in an ISS is too great for any one disk server and an associated cluster of disks to satisfy, so the data is striped across several disk servers with each one being responsible for a separate cluster of disks.

3.3 Asynchronous Disk Access

Since the ISS stripes data across multiple disks, it needs to read multiple disks. In order to read the data at the maximum aggregate speed of the disks, all the disks need to be read in parallel. However, such parallel reading is not always possible for a single process. Asynchronous I/O operations to disk are usually supported for kernel processes but not for user level processes. While it is possible to add user-level support, modifying the kernel of every operating system on which the ISS might run on to add this support would have reduced the ISS’s portability. Instead the ISS creates a separate process for each disk. Then this process is responsible for reading data from a particular cluster of disks, leaving the problem of managing asynchronous disk operations to the kernel.

3.4 ISS Components

Below is a diagram showing how the components of the ISSs are integrated with each other. Following that is a description of each of the separate software modules and the responsibilities of each as it relates to the other modules in the ISS. Figure 4, "ISS Components," on page 19 illustrates each ISS module; the directional lines show the flow of information.

Notice that the figure illustrates a two server ISS, and that the application must multiplex the data that comes back to it. Although the data that is returned to the application has a header informing the application what the data is, there is no way for the application to know before hand from which ISS server a particular piece of data is going to arrive from, nor should it care

3.4.1 ISS Master

The ISS master is responsible for process initialization, tile name lookup (name server functionality), and coordination between servers. The master process waits for data unit request lists from the application. Upon receiving a list, the master performs a table lookup, calling the name server function to determine where the data unit is located (i.e. which server, which disk, and at what offset on the disk). Data units are sorted into per-server lists and each list is sent to the server that can satisfy those particular requests.

The master process is also responsible for receiving status messages from the servers. Each status message contains information on the current aspects of the server, such as the number of bytes read off each disk, the number of bytes written to the network, the frequency of the requests to the server, the number of data units in the servers memory cache, and the number of data units that could not be written to the network. It also contains the number of data units that could not be read from disk due to the frequency of the requests (for an explanation of what this means see section 5.1.) The status messages allow the master process to monitor the overall state of the ISS system.

3.4.2 Name Server

A name server is usually a function or set of functions that allow a one-to-one mapping of a name to an object that is associated with that name. In the ISS name server, a set of x and y coordinates and a resolution for a piece of terrain data, corresponds to the "name" with which is associated a server, disk, and disk offset specifying the location on disk of the data that matches the x and y coordinates and the resolution. This same approach is used with video data, except that the frame number is used instead of and x, y and resolution.

3.4.3 ISS Disk Servers

There is one ISS server process for each ISS disk server. It is responsible for all ISS memory buffer management on that host as well as all server inter-process communication.The server reads a list of tile requests from the ISS master (described above) and determines whether each requested tile is already in the server’s buffer cache. If a tile is already in the buffer cache (which is kept in available memory), then the request is added directly to what is called the "send" list for handling by the ISS writer module. Otherwise the request is added to a "read" list for the disk where the tile is located.

The ISS server is also in charge of sending status information periodically to the ISS master. The server reports the number of tiles sent to the application, and number of buffer cache hits, and the number of flushed tiles (see Section 5.1 for an explanation of tile flushing).

The current ISS strategy calls for the ISS master to sort tile requests from an application into lists corresponding to each server; the servers thus do not perform any sorting of the lists they receive from the master. An alternate strategy, still under consideration, would eliminate the sorting stage from the master: instead, the master would broadcast all requests to all of the servers, allowing each server to determine what, if anything, it needs to provide. In the second approach, the time wasted by the unnecessary work done by the servers is very limited compared to the time needed to read data from disk and to send it to the network. Currently, CPU utilization is low, so the extra overhead incurred by the first approach is negligible. The obvious advantage to this second approach is that it prevents the name server system from becoming a bottleneck.

3.4.4 ISS Reader

The ISS reader process reads data from one disk into the buffer cache. The reader continually checks its "read" list. If it finds a request in the list, it posts a read to its disk, and must wait for the read to complete. The ISS waits for a read to complete and doesn’t post several reads to a disk controller at one time as there is one thread per disk. When there is a thread that does only synchronous reads from disk, we are effectively getting the maximum sustained bandwidth from the disk. Posting several asychronous reads doesn’t gain any exta bandidth. Once the data is read off of the disk it is moved into both the buffer cache managed by the server and a "send" list, and the reader module starts to read from disk again, if anything is left to be read from disk.

There are two distinct "send" lists in the server, one contains high priority requests and the other contains lower priority requests. The priority of a particular request (e.g., a tile) is determined by the application that is requesting the data. An application showing a movie with a sound track, for example, might decide that all the sound should be of a higher priority and all of the pictures should be a lower priority. In this case the sound would always be sent before the pictures if they were both ready to be sent out to the application at the same time. This is modeled after the concept of Quality Of Service (QOS) that is beginning to make its way into the networking world[5]. Allowing the application to decide among different streams of data, which is more important to it, and having the ISS wait with lower priority data in favor of higher priority data, gives the application a fine grain of control over the available resources.

There is also a third priority, for data that is to be read from the disk, but not sent back to the application. The data is put in the memory and is thus more readily available for later retrieval. Such pre-fetching of data is useful for bursty programs that can predict some of the data they will need in the near future.

3.4.5 ISS Sender

The ISS sender process writes the data in the "send" list out to the network, hence to the requesting application. The sender continually checks the "send" list for high-priority data to send out to the application. If there is no high-priority data it then checks through the list looking for lower-priority data. Note that lower-priority data will only be sent out if there is no higher-priority data waiting to be sent. However, to ensure that the ISS is always sending ther are times lower proiority data might be sent before higher-priorit data. This could occur, for example if the lower priority data is in the memory cache and the higher priority data is resident on disk. In this latter case the lower-priority data is waiting to be sent, and the high-priority data has to be read off of disk.

3.5 ISS Daemon

The ISS is controlled and monitored by "issdc", a Unix-style[13] command line interface modeled after Dave Mills’ xntpd and xntpdc network time programs [18]. Commands to control and to request status information of the ISS system are sent using "issdc". Using issdc a user can restart or reset the ISS or load a new data set. The user can also request status and statistics from the ISS system. The issdc is also very useful for monitoring the condition of different ISS servers. Using issdc one can remotely debug an operating an ISS across the network.

3.6 ISS to Application IPC

In designing the ISS to provide some "file system" functionality, one of the main goals was to mimic the statelessness found in some file servers. Statelessness implies that the current state of the ISS does not depend on the state of the application driving it. If the application that is using the ISS crashes, or if the network connection fails, or if any other service is interrupted, the ISS will detect this interruption and wait for the connection to be reestablished. On reconnect, service can continue as if the interruption had never taken place, since the ISS, in essence, doesn’t care about the interruption.

The application can request the ISS to flush the cache of all data. This feature is useful if the application has decided to move to a new data set (e.g. new geographical location). The ISS also allows several data sets to be stored simultaneously. At any time, the application can begin to request data from a different data set. The ISS is unaware of the change, in as much as all requests require the same type of work to be performed.

3.7 ISS Initialization

The ISS must load onto the disks the data that is to be used from the distributed servers beforehand. This pre-loading is done through the use of several support programs: a loader, and scribes. In Figure 5, "ISS Initialization," on page 25, one can see how the various components interact with each other. The application in the figure asks the ISS Master to load a particular data set. The ISS Master in turn requests of the loader program to contact the MSS (Mass Storage Server) and obtain the data for the application. As the data is transferred off of the MSS it is transferred to the scribes, who write it to the disks of the various servers. For a full explanation of each of the components involved in the pre-loading of the data see sections 6.2 - 6.4 later in this thesis.

3.8 Network Infrastructure

The ISS was designed to be most effective on top of a high bandwidth, low latency network, such as ATM. The ISS is currently implemented in several different networks that meet these requirements. One of the networks is a wide area ATM network known as MAGIC that covers many hundreds of miles. The ISS has also been run over FDDI and Ethernet LANs. ATM networks use switches to multiplex data streams from other ISS sources.

Gateways or routers allow the ISS to operate over combinations of ATM, FDDI or Ethernet. In Figure 6, "Data Streams Aggregated by ATM Switches," on page 27 one can see the general idea of how multiple disks and servers would all act simultaneously to deliver the data. This topology allows for a large collection of disks to seek in parallel, and all servers to send the resulting data to the application in parallel, enabling the ISS to perform as a high speed image server. It is this parallelism that the ISS was designed to exploit.

4.0Current Network, Disk & Workstation Technology

Here follows an overview of some of the workstation and network technology that is currently available, as well as an examination of how it might meet the needs of the Image Server System. An exhaustive survey of contemporary hardware is beyond the scope of this paper, but presenting some of the pros and cons associated with these technologies should give some insight into the choices of hardware for the initial implementation of the ISS.

4.1 High Speed Networks

The following sections describe some of the fastest data transfer methods that are available today in the networking world. It examines and gives some background on the following: HIPPI, FDDI and ATM. This section will also examine SONET, which can carry ATM cells, or FDDI traffic. These were all examined and evaluated before starting work on the ISS, so that a baseline expectation could be made as to what kind of bandwidth could be expected from an application such as the ISS.

4.1.1 HIPPI

High-Performance Parallel Interface (HIPPI) is perhaps best described by David Clark at the Gigabit Testbed workshop in Washington D. C. when he stated, "I like to think of HIPPI as an 800 Mbits/s RS-232". HIPPI is a set of standards produced by the ANSI X3T9.3 committee that describe the electrical, mechanical and signaling characteristics of an 800 or 1600 Mbits/sec, word-parallel, dual simplex, point-to-point link. This link is achieved over copper cables that can be up to 25 meters in length. HIPPI’s main advantage over all other current forms of data transfer over copper is that it transfers the data in parallel, 32 or 64 bits at a time, which accounts for its high speed. At 800 Mbits/sec the data is sent in 32 bit parallel and at 1600 Mbits/sec the data is transferred in 64 bit parallel bursts[3]. HIPPI’s main disadvantage is that it is limited by the length of the cables (only 25 meters). It is also possible to extend HIPPI through the use of fiber extensions which allow it to be used at a range of up to 10 kilometers.

One of the reasons that I mention HIPPI at all, since it is at best a restricted networking protocol, is that it was the first standard way to connect devices at high data rates and can be used in conjunction with a switch to provide a gigabit LAN service [35].

4.1.2 FDDI

FDDI (Fiber Distributed Data Interface) is a fairly new standard for a network technology. It is based on fiber optics as the transport medium. FDDI is a 100 Mbps Local Area Network (LAN). FDDI has several advantages that make it a good platform to use as a backbone to Ethernet (10 Mbps) LANs: it is a token ring, has a maximum length of 200 km, can accommodate up to 1000 connections, and is deterministic.

FDDI is a token ring based system. Each host has two streams connected to it, an upstream where it sends the data, and a down steam where it receives data. A token circulates around the ring: when a node sees the token, it has the right to capture the token, substituting its own data in the token’s place. After holding the token for up to a specified maximum amount of time, the token is put back onto the ring and the next node has the opportunity to send data. As the data goes around the ring, each host’s interface looks at the data as it comes in and copies it to the outgoing stream. At the same time it scans the data for the address: if the address of the data matches the address of the host, the host keeps a copy of the data locally and as the host copies the data to the upstream it will set a copied flag in the data so that when the data completes its way around the ring and reaches the host that sent it, that host will know it has been received.

As I have already stated, an FDDI network can span up to 200 km, accommodate up to 1000 hosts, and is deterministic. Each of these factors is interleaved with the others. The specifications for an FDDI network require a maximum token latency of 1.617 ms. This maximum latency makes FDDI deterministic, and defines most of the elements of an FDDI network. Each host is guaranteed a chance to use the ring, unlike Ethernet which employs a first-come-first-serve algorithm (CSMD/CD)[30]. The timing also affects the number of hosts and the length of the network. Since a token can travel on an FDDI line at a speed of 5085 km/ns (speed of light in glass), one sees that a latency of 1.017 ms is introduced for a length of 200km. The time it takes each host to transfer a message through its interface, which is 600 ns, multiplied by 1000 host interfaces introduces an additional latency of 0.60 ms. This is a total latency of 1.617 ms, exactly what the standard specifies for an FDDI network. Of course this is a contrived example to show how the standard for FDDI was created, but it does indicate some of the limits involved in any large network. The speed of light, for example, now begins to play an important part in the construction of the next generation of networks. The networking community must reexamine the original premises made for networks to see if there are any hidden constraints that could have been overlooked in previous conceptions of how networks were to be created.

4.1.3 ATM

Asynchronous Transfer Mode (ATM) is a packet switching and multiplexing technology used to transmit data. It combines multiplexing, cross-connecting and switching information into 53 byte cells (48 bytes of user data, plus five bytes of overhead). ATM can operate over various physical media, include SONET, SDH, and DCS. ATM is what is considered a packet switching system. It converts a message into segments before sending it out over the network, and then recombines the segments into the correct order at its destination. But unlike conventional packet switching, ATM uses small fixed-length cells. This use of fixed length cells instead of the commonly used variable length improves switching speed since all packets can be switched in hardware.

To accommodate the needs of different ATM users, on top ATM lie what are called Adaptation Layers whose functions depend on the desired service. For example, voice requires a constant bit rate, so a voice application would ask the Adaptation Layer for a VCI (virtual circuit identification) with a constant bit rate and the adaptation layer would take care of the details. Another application might need to send large packets of data over the link. Since an ATM cell is so small, the data must be fragmented, transmitted, and reassembled. The adaptation layer would handle these chores, taking care of segmenting the data, filling in the cells, and checking the data on the other side of the connection for lost or corrupted cells.

4.1.4 SONET

SONET, which stands for Synchronous Optical Network, is an optical interface standard. This new standard covers data rates ranging from 51.84 Mbps to 13.27 Gbps in increments of 51.84 Mbps. These values are dictated by the underlaying optical carrier (OC). Several levels of OC have currently been defined:

OC-1 (51.840 Mbps)

OC-3 (155.520 Mbps)

OC-9 (466.560 Mbps)

OC-12 (622.080 Mbps)

OC-18 (933.120 Mbps)

OC-24 (1244.160 Mbps)

OC-36 (1866.240 Mbps)

OC-48 (2488.320 Mbps)

Unlike other optical interface standards, the SONET standard allows different data streams to be combined, thereby increasing bandwidth. SONET has a special framing format to support multiple transmission "flows" across the network. These flows can carry ATM cells across the network. Associated frames are called streams, and each frame has a header to describe the contents and a data section. These streams are defined through an electrical equivalent to the OC, called a Synchronous Transport Signal (STS). Each STS frame is a 125-microsecond signal that contains 810 bytes. These bytes can logically be viewed as a matrix of 90 columns and 9 rows. It is illustrated as rows and columns so that one can logically view the first three columns as containing the header for the packet. This header is important because it contains "pointers" to data that reside in the rest of the frame.

4.2 High Speed Storage

The following sections explore various methods of increasing the data flow from disk to computer. Since one of the primary purposes of the ISS to move data off disk and onto the network as quickly as possible, this is a very important aspect to explore.

4.2.1 Hard Disk Technology

Hard disks currently are the most common medium for storage of data. The current technology can realistically read data off of a disk at about 3 MBytes/sec, although vender specifications often claim that higher speeds are possible. Most of the equipment that was used in the initial development of the ISS did not meet the manufactures specifications, and numerous bugs were discovered in the hardware and software along with incompatibilities with many of the existing networking drivers that were currently used. Through constant contact with the various venders of the hard disks all of the problems encountered were resolved.

It should be noted that the manufactures spec for a disk’s speed is often obtained using special hardware that just reads a disk, which gives unrealistic maximum throughput. However, with the ISS the disk speed is measured while it is running on a workstation in conjunction with other process and network activity. This greatly reduces the performance of the disk.

4.2.2 Small Computer System Interfaces (SCSI)

SCSI (Small Computer Systems Interface) is a standard hardware interface and command set for connecting peripherals to a computer. The SCSI standard can be divided into SCSI (SCSI 1) and SCSI2 (SCSI wide and SCSI wide and fast). SCSI 2 is the most recent version of the SCSI command specification and allows for scanners, hard disk drives, CD-ROM players, tapes, and many other devices to connect. SCSI is becoming a popular standard, with more and more computers using it.

4.2.3 Fast SCSI

The addition of "Fast SCSI" allowed a 10 MHz transfer rate, which came out of a joint effort with the IPI (Intelligent Peripheral Interface) committee in ASC X3T9.3. Fast SCSI achieves 10 Megabytes/second using 8bit transfers and with wider data paths of 16- and 32-bits can achieve up to 40 Megabytes/second. By the time the market starts demanding 40 Megabytes/second, it is likely that the effort to serialize the physical interface for SCSI-3 will cause high-performance SCSI users to examine the existing Fiber Channel (25 Mbytes/sec to 130 Mbytes/sec)

At this time the fast SCSI parameters cannot be met by the Single Ended electrical cable drivers, and is only suitable for Differential drivers. One of the goals in SCSI-3 is to identify the improvements needed to achieve 10 MHz operation with Single Ended components.

4.2.4 Differential SCSI

"Normal" SCSI is also called "Single-ended" SCSI. For each signal that needs to be sent across the bus, there is only one wire to carry it. With differential SCSI, for each signal that needs to be sent across the bus, there is a pair of wires to carry it. The first wire carries the same type of signal the single-ended SCSI carries. The second in this pair, however, carries its logical inversion. The receiver takes the difference of the pair (thus the name "differential"), which makes the connection less susceptible to noise and allows for a greater cable length.

4.2.5 Wide SCSI

Wide SCSI added several features to SCSI. SCSI may now transfer data at bus widths of 16 and 32 bits. Commands, status, messages and arbitration are still 8 bits, and the B-Cable has 68 pins for data bits. Cabling was a confusing issue in the closing days of SCSI-2, because the first project of SCSI-3 was the definition of a 16- bit wide P-Cable which supported 16-bit arbitration as well as 16-bit data transfers. Although SCSI-2 does not contain a definition of the P-Cable, it is quite possible that within the year, the P-Cable will be the most popular non-SCSI-2 feature on SCSI-2 products.

4.2.6 RAID

The designers of RAID (Redundant Array of Inexpensive Disks) postulated that many slow disks can match the performance of a high speed disk through the duplication of data across all the disks and the reading of that data back in parallel. No matter how slow the disks are you can still achieve excellent performance. With some of the newest technology RAID is able to provide 40 MB/sec sustained, with 40GB storage capacity. RAID in its commercial form is now available for almost all types of computers, from supercomputers, to personal computers. With the newest additions to the SCSI command set it is now possible through software to create one’s own RAID configuration. All that is required are several chained SCSI disks onto a SCSI host adaptor and the correct driver software. Once the disks are connected to the SCSI, the software directs the host adapter to treat all the disks as one logical unit. A user can configure the software to work in one of several different ways, allowing full control over the amount of data duplication across disks. The user can also configure the total size of the RAID configuration. If there is not enough room on the disks to duplicate all the data the user can configure the software to treat the unit as a smaller disk, or vice-versa.

4.3 Hardware Platforms Issues

There were several different computer architectures that were available for the development of the ISS. Here I examine some of the hardware considerations that affected the choice of the final platform that was used to develop the ISS are examined.

4.3.1 Multiple CPUs

It might seem logical that there would be a linear increase in the speed of your program in direct proportion to the number of processors that a host contains, but this is not always true. In reality there is more processor time for each process in the system, which gives the illusion of increased speed as your process gets more CPU time. In Figure 17, "Memory Behavior," on page 80, one can see the impact of multiple process accessing memory "simultaneously" as a function of the number of processors. Once the number of processes accessing memory exceeds the number of processors available, performance degrades. The graphical analysis indicates that for the platforms tested, that each processor has relatively independent access to memory, up until the aggregate throughput exceeds the bandwidth of the memory subsystem. Further, a single process is always able to use the memory access potential of a single processor; multiple processes never increase the memory throughput on a single processor.

In the current implementation of the ISS, there is a separate process for each software module; one for each disk reader, one for the cache manager, one for the module responsible for sending the data over the network. An ISS host with the ability for using several CPU’s simultaneously can enhance the memory limited performance of the ISS. The enhancement caused by the increase in CPUs is still bound by the fact that a critical resource is being shared by all of these processes. The resource in this case is memory for the cache of tiles. Thus there must be a semaphore lock on this resource, and semaphore locking and unlocking cause inefficiencies still to be overcome. Redesign of the ISS to use threads is helping reduce the impact of these problems. A more thorough description of threads and their possible uses are examined elsewhere in this paper.

4.3.2 Interleaved Memory

Interleaved memory is a method to access memory in parallel. This section will discuss non-interleaved ("normal") and interleaved memory and the benefits of the latter. Also shown will be the speeds of different systems using different configurations of interleaved memory.

4.3.3 Non-interleaved memory

In a non-interleaved memory system, the memory is presented in N banks of memory. When addressing the memory all of the previous bank of memory is addressed before the next bank of memroy. Figure 7, "Non-interleaved memory layout," on page 39 shows this organization for two banks of N long words. (A long word is 4 bytes, or 32 bits, and is the natural unit of memory for most of the computer systems that are examined this paper).

Most computer systems perform burst accesses, single bus transactions that read or write 16 bytes in 4 adjacent long words, to move data between their caches and memory. All 16 bytes come from one bank of DRAM in a non-interleaved memory system, so the time required to complete the transfer depends directly on the access time of the DRAM.

4.3.4 Interleaved Memory System

In an interleaved memory system, there are still two physical banks of DRAM, but logically the system sees one bank of memory that is twice as wide. In the interleaved bank, the first long word of bank 0 is followed by the first long word of bank 1, which is followed by the second long word of bank 0, and so on. Figure 8, "Interleaved Memory Layout," on page 40 shows this organization for two physical banks of N long words. All even long words of the logical bank are located in physical bank 0 and all odd long words are located in physical bank 1.

The interleaved memory configuration is designed to speed the burst accesses by as much as 30% on some systems [1]. The actual improvement depends on the system clock speed and the DRAM access time. Since the four long words of a burst access are spread across two physical banks of DRAM, the access can be overlapped to hide part, or most all of the DRAM access time delay.

This methodology can be expanded to more than the two-way interleaved memory shown in this example. Examples of three-way or four-way interleaved memory are common in today workstations.

4.4 Operating Systems Issues

There are several types of operating systems that were considered as possible development environments for the original work on the ISS. On several of the platforms that are discussed in this paper, real time operating systems are available and were considered as a possible base system for the ISS. One of the main reasons that a real time operating system was not chosen was that such a system is specialized, expensive, and can be difficult to program on, depending on the application. Had the ISS been built to run on a real time system, the job of porting it to heterogenous platforms would have been a monumental task. As the ISS is currently implemented, it has been ported to several different platforms with a minimal amount of trouble. The operating systems that are currently supported are the Sun Solaris 2.4, Dec OSF/1.x, and SGI IRIX 5.1.1.

4.4.1 Multi-Threaded Architecture

A traditional UNIX process has a single thread of control. What a multi-threaded architecture provides is the ability to have more than one thread of control in a single process (single address space) that executes independently within a process. In general, the number of threads, or identities, that an application applies to a problem are invisible from outside the process. Threads can be viewed as execution resources that may be applied to solving the problem at hand. The paradigm that is being exploited in this is that the more resources that are applied to a problem, the faster the problem is solved. The following is a description of POSIX standard threads as implemented by the major vendors, Dec, and SUN, but can be applied to most thread models.

Threads share the process instruction (or text) area and most of its data. A change in this shared data by one thread can be seen by the other threads in the process. There is often the mistaken idea that everything is shared between thread, but each thread can also have private data that can’t be viewed by other threads. Threads also share most of the operating system state of a process. For example, if one thread opens a file another thread can read it. There is no system-enforced protection between threads.

There are a variety of synchronization facilities to allow threads to cooperate in accessing shared data. The synchronization facilities include mutex locks to insure mutual exclusion, condition variables, semaphores, and readers/writers (multiple readers, single writer) locks. The synchronization variables are allocated by the application in ordinary memory. Threads in different processes can also synchronize with each other via either synchronization variables placed in the Operating Systems shared memory or mapped files, even though the threads in different process are generally invisible to each other. Synchronization variables may also have different variants that can, for example, have different blocking behavior even though the same synchronization semantic is maintained.

Each thread has its own signal mask. This permits a thread to block asynchronously generated signals while it accesses a state that is also modified by a signal handler. Signals that are synchronously generated are sent to the thread that caused them. A signal that is externally generated is sent to one of the threads within the process that has that signal unmasked. If all threads mask a signal, it is set pending in the process until a thread unmasks that signal. Signals can also be sent to particular threads within the same process.

A multithreaded process can fork in one of two ways. The first way clones the entire process and all of its threads. The second way only reproduces the calling thread in the new process. This last method is useful when the thread is simply going to call exec().

4.4.2 Real Time Systems (scheduling)

A real time system is defined as "a system that performs its functions and responds to external, asynchronous events within a predictable (or deterministic) amount of time" [8]. This is the type of response time that the ISS requires. Such real-time systems were until very recently specialized, expensive, or specific to a single platform. However, better hardware and new operating systems have changed "real time" from just a buzzword or a luxury to a practical reality on many platforms.

An application like the Image Server System could benefit greatly from a real time operating system. The ISS is driven by external events (the requests for the images) and must deliver the images back to the driving application within a pre-determined time (before the next request, because of the ISS flushing algorithm). These requests help match the description of what a real time system provides. The ISS would also benefit from the scheduler in a real time system. The ISS, by its very nature, must at certain times resolve conflicts with itself. For example, one ISS component may be trying to put an image into the send list, while another component is trying to see if there is anything in the list to send out. In most conventional operating systems such actions require sequential access to the list, but in a real time system the part of the program waiting to send out the image can just be "woken up" when some new data has been put in the list to be sent out.

4.4.3 TCP/IP Window Size

A TCP window is the size of the buffer space available on the receiver end of a single TCP connection. The larger the buffer space, the more packets it can accept before the host has to process them or tell the sending application to slow down. This buffer size also determines the number of packets that can be in transit in the network. The transmitter has to hold a copy of all data being transmitted until such time as it receives an acknowledgment that the data was received by the other side (receiving host).

There is a direct correlation between the maximum speed that can be attained on a network and the size of the window: a bigger window implies higher speed. Of course, increasing throughput is never as easy as just changing the size of the window, else it would have been done long ago. There are several problems with the current implementation of TCP and window sizes.

The first problem is how TCP specifies window sizes. The window size in a TCP header is represented in 16 bits, which means that the window size can only be 2**16, or 65536, bytes (or 65k.) That may sound like a lot of data to be in transit, and it was in the early days of networking. Today, with the advent of gigabit networks, this is not enough, especially when the distances involved in these new high speed networks are taken into consideration. Greater distances allow for more data to be in transit. A one megabit, coast to coast network can effectively buffer (has "in transit") N bytes of data. A 622 Megabit/s (OC-12) network can have 622 x N bytes in transit (i.e. having left the transmitter and not been received).

Van Jacobson has recently helped to solve this problem by adding options to the negotiation part of a TCP connection to advertise a larger window size[11]. This is done using the previous 16 bit field in the header to specify the window size in a new format, that of log base 2, which allows for a much larger window size.

Of course with all this data in transit, the possibility that some of it may be corrupted in some way is much more likely than before. TCP recovers from such errors, but there is now the added problem that there is not a small number of outstanding packets, but a very large number of them, which brings us to the second problem.

The second problem (most prevalent with the advent of gigabit networks) with performance of a gigabit network is based on the product of the transfer rate and the "round trip time" or RTT. This product gives the amount of data that can be outstanding. Van Jacobson refers to gigabit networks in his RFC 1323, and states that the RTT of such a network would be 17 seconds. A 16 bit window header divided by such a small RTT would be very prone to what is known as "sequence number wrap", where there are so many packets in transit that the packet sequence numbers start over again. When two packest have the same sequence number TCP has to stop the transmitting and receive all of the outstanding packets. The it must restart the entire transmission again. This stoping and restarting in turn lowers the overall performance of TCP.

Sequence number wrap has also been solved by a method called PAWS (Protection Against Wrapped Sequence Numbers), again developed by Van Jacobson. I refer the reader to RFC 1323 for a complete description of the implementation of PAWS, the details of which are beyond the scope of this paper. It should be noted that at this time most venders don’t implement this. As networks become faster and longer, this is going to be a problem.

5.0 ISS Implementation Details

This section will examine how the ISS was implemented to avoid the various problems that have already been discussed in the preceding sections, and how it can still meet the requirements enforced by the nature of what the application is, a multi-media server. Also examined will be how the ISS was implemented using the current generation of hardware.

5.1 ISS server buffer management

The buffer management process has a threefold task. It must act as a cache of recently used tiles: all requests for data off-disk are cached in the buffer for possible use later on. It must manage the memory buffer for the data being transferred from disk. Finally, it has to manage the memory buffer for the data being transferred to the network interfaces. The functionality of the buffer management is very similar to that of the UNIX operating system, and many of the ideas for lists, hashing, and the format of the headers were adopted from that system for use with the ISS.

Figure 11, "Buffer header for internal representation of data," shows the format of a header structure as it is represented inside the ISS.

The buffer has two main parts. The buffer header contains information used to find the buffer and describe its contents. The content information includes the name (as used by the name server for lookup), disk, offset, and other information about exactly what will be in the data portion. It also includes the links that connect it to the various lists in which the buffer will reside during its lifetime. A buffer that is in the hash table list can at the same time be in the list to write it out, the pointers for these lists are contained in the header. Also contained in this header is timing information: when the request arrived, when it was looked up in the cache, when it was read in from disk, and how long it took to read it from disk. The flags contain status information about the current state of the buffer: is the buffer being sent to the network?: is the buffer being filled in from disk?

The second part of the buffer contains a pointer to the data that was read from the disk (if the buffer was valid). In the current implementation the buffer size is fixed in the system, but it would be easy to make this adjustable for different sized blocks of data.

A buffer can only be freed from the hash table in two ways. The first way is that the buffer is allocated to a list (read/write) and that list is flushed. In this case the buffer is returned to the free list (actually it is put at the head of the LRU list so that it is the next block to be reused). The second way is for a buffer to progress all the way through the LRU list until it has reached the end of its life time, when it is recycled.

Figure 11, "Snapshot of buffer pool with 2 read list (2 disks)," on page 51 shows an example how a buffer might look were there only two disks and a very small amount of memory allocated to the cache. All valid buffers are in the hash list, and the hash table is used to quickly locate a buffer if it is in the cache; otherwise a new tile is aged out of the LRU list for use.

There are several lists that run through the buffer pool. The free list consists of the tiles in the memory pool that haven’t been filled in. This list encompasses the entire memory pool at start-up time, but since the ISS tries to buffer as much data as can possibly fit into memory, the free list is empty as soon as all the memory buffers have been filled.

The LRU list traverses all of the memory buffers, taking the next block of memory to be freed from the head of this list. After the block has been filled in with data from the disk the block is returned to the tail of this list. Any time a block of memory is touched, it is returned to the tail of the list.

The send list contains all of the valid buffers that are ready to be sent out to the network. After a buffer has been sent out, it is removed from this list, but still remains in the hash table in case it is re-requested before the LRU algorithm can age it out.

The read list is composed of blocks taken from the LRU list and filled with tile header information (name, disk, offset, etc.). There are several of these "read lists" running through the memory pool simultaneously, but a buffer can only be in one of these lists at any time. Each read list is associated with a single disk.

5.2 Tile Requests, TerraVision to ISS

The ISS handles three levels of requests dubbed priority request, prediction request, and fetch request.

Priority requests should be satisfied before requests at other levels. Priority requests also have an implicit order within a single request list: if a request list contains three priority requests, the first is rated higher then the second and the second higher then the third.

A prediction request is produced by a prediction function in the application. In the case of video it would just be frames several minutes or seconds in advance of what was currently showing on the screen, depending on available memory. In the case of the Terra Vision terrain visualization application, a predictor module makes requests for terrain images (tiles) well in advance of what is being shown on the screen, but in the same direction as the visualization is traveling. In the case of a movie, the prediction function could be just a linear list of upcoming frames. From the standpoint of the ISS, prediction requests are identical to priority requests except for their relative urgency. If there is a priority request waiting either to be sent out or to be read off of disk, that request always has priority over prediction requests. Prediction requests are used to keep the memory buffers full at the requesting application, to keep the network bandwidth in full use, and to keep the disks busy with tiles that have a high likelihood of being needed.

A fetch request asks the ISS to read the data off of disk and store it in the ISS’s memory cache for possible later use. The reason for this will depend on the application. The application may know that it is going to use the data, but doesn’t have room to store it; the application may "predict" that the data will be used in the future, but in case the prediction is wrong the data will stay on the ISS until such time as it is needed. Data that is stored in the ISS’s memory cache is accessed about five times faster than data on disk. Thus, it benefits the application requesting data to help predict its own usage of that data.

TerraVision uses a path prediction algorithm to predict what tiles will be needed in the near future, and assembles these tiles into request lists for each of these three types of tiles. By always requesting more tiles than the ISS can actually deliver before the next tile request, Terra Vision ensures that no component of the ISS is ever idle. For example, if most of a request list’s priority tiles were on one server, the other servers could still be reading and sending tiles that may be needed in the future instead of being idle.

5.2.1 The Path Through the ISS

The ISS is really two very different programs. How they share the task of finding and getting data off of disk is what makes them so unique. The first program is the ISS Master. Its purpose is to receive requests from the application and look them up in a hash table to find the server, disk, and offset of the request. It then forwards a list of requests to the server.

The other program is the ISS Server. The ISS Server receives a list of requests from the ISS Master. The first thing the ISS Server does upon receipt of a list of requests is to flush all requests from the last previous request list: all requests still in the read or write lists are removed from those lists and any memory buffers that have been filled in from disk and are waiting to be sent out are returned to the hash table. Those buffers that were waiting to be filled with data from disk are put at the head of the LRU list so that the header may be filled with the requests of the newly arrived list.

The newly arrived list of requests is checked against the buffers that are in the hash table. Those requests that already reside in the hash table are put directly into the "send" list. The requests that aren’t in the hash table are put into the "read" list.

The read processes go through the read list for the disk for which they are responsible seeing if there are any requests for data off that disk. If there is such a request it is removed from the "read" list, loaded with data, read off disk, put into the send list, (placement in the list is dependent on the tile’s priority), and added to the hash table as a valid buffer of data. The process in charge of sending data out over the network interfaces is constantly checking to see if there are any buffers in the "send" list. This transmitter process removes buffers from that list, marks the buffers as having been sent to the network, and sends them to the network interface.

5.3 Tile Placement Algorithm

For the ISS to perform optimally, it is necessary to determine a strategy for the layout of data on the disks, taking into account all of the relevant parameters (disk configuration, disk performance, server performance) that will give maximum parallelism for each data request.

For tiled images the ISS uses a method developed by Chen and Rotem of LBL’s Data Management Group, described in [6]. The main idea is to decluster tiles so that all servers and disks are evenly accessed by tile requests. A distance vector-based declustering method distributes data among the servers. Within a disk, however, it is desirable to cluster the tiles such that tiles near each other in 2-D space are close to each other on disk, thus minimizing disk seeks. The clustering method used here is based on the space filling Hilbert Curve. The main objective of the clustering method is to ensure that tiles assigned to the same disk are as far apart as possible on the image plane. This tends to minimize the chance that the same disk or server will be accessed many times by a single tile request list.

Tiles are distributed among K disks by first determining a pair of integer component vectors which span a parallelogram of area K. Tiles assigned to the same disk are separated by integer multiples of these vectors. Chen devised a fast method which finds the best pair of vectors satisfying the above principles. The method is general for any number of disks and in most cases is within 7% of the theoretical optimal placement, which does not restrict vector components to be integers.

The guiding principle for clustering is to place tiles which are likely to be requested together (i.e. tiles near each other on the 2-D image plane) as close as possible on the disk. Doing so will tend to minimize expected disk seeks. The algorithm builds a minimal space filling Hilbert Curve on the plane which covers all the 0-points (see [12]). The Hilbert Curve is used because it has been shown to be the best curve that preserves the 2-D locality of points in a 1-D traversal.

5.4 UNIX Operating System Issues

During the implementation of the ISS there were several discoveries made about how the UNIX operating system can diminish the expected performance of the workstation. The following comments are specific to a Sun SparcStation 10 running SunOS 4.1.3, but may apply to other UNIX workstations as well. The new versions of UNIX such as Sun’s Solaris 2.x and SGI’s Irix 5.x address many of these issues. The main concerns are:

semaphore un/lock time

context switch and sleep scheduler time

These issues both involve maximizing the use of the CPU at any given time. As the ISS is I/O bound, it should not be active in the CPU while reading or writing, yet when any part of the ISS does require the CPU, it should not have to wait. For example, after an ISS sender process is finished sending (writing) a tile it should be able to tell the scheduler it’s done, which will allow the next process to access the CPU. The scheduler should not switch to another process until the ISS sender’s time slice has expired.

One attempted solution to the problem was to put the process to "sleep" via the UNIX system sleep() command, hoping that the scheduler would notice that the process was in an "idle" state and would context-switch to the next available process. However, the process that was put to sleep continued to hold the CPU for the remainder of its time slice. Since the time-slice allowed to each process is preset by the system and cannot be changed, the scheduler wasted a relatively large amount of CPU time.

An associated problem is that of context switch time. One of the reasons that the time-slice for all processes is set to a relatively large value is that the amount of time needed to switch between processes is relatively large. This time to switch between processes is commonly known as "context-switch time" and the switching of processes is called a "context-switch". In a context switch, the state of the currently running process is saved and replaced by the previously saved state of some other process.

The system schedules processes according to a "round-robin" schedule. A round-robin schedule is a simple queue, where the last process to run is at the back of the queue, and the process that hasn’t run in the longest time, and is ready to run, is at the front of the queue. Thus every process gets a chance to run. The problem with this scheme is that the system must move a process into the CPU to see if it is ready to run. Idle processes, which are not ready to run, incur this overhead.

Most of the problems described here have been addressed by Solaris 2.2, with its "real-time" hooks for scheduling and its improvements in the System V shared memory operations, especially those involving semaphore un/locks.

5.4.1 RTP as a Transport Protocol

We believe that with this class of application the extra overhead of the TCP protocol, particularly its retransmission mechanism, may be unnecessary and undesirable. To obtain reasonable performance using TCP in an ATM WAN one must set large TCP buffers (on the order of 200 KBytes). Not all UNIX kernels support buffers of this size, and even if they did so, buffers this large take a lot of system resources. It has been demonstrated that TCP does not appear to be a "fair" algorithm in this environment, as predicted by Jacobson [11]. In visualization-like applications it is therefore our belief that retransmission should be left up to the application, and that a datagram-based protocol like Real Time Protocol (RTP) [25] should be used.

An example of an application which can better determine whether or not retransmission is necessary is the MAGIC terrain visualization application. With TerraVision, the Image Server System (ISS) responds to priority-ordered, time-sensitive requests for sub-image tiles. Each request list arriving at the servers is self-contained, i.e., it represents the total, current need for image tiles. Therefore a list that contains a request for a tile previously requested means that the tile was not delivered and, if a tile currently waiting to be delivered is not in the current list, then it is no longer needed. In particular, a tile that has been sent, and does not show up in the next request list, is no longer needed regardless of whether it was actually received by the application. (It might have been lost between the ISS and the application.) This situation arises when a "fly-through" visualization has passed the virtual landscape location that contains the lost tile.

The ISS supports multiple transport protocols for data delivery. For completely reliable delivery, the Transmission Control Protocol (TCP) is used. When unreliable delivery is acceptable, the User Datagram Protocol (UDP) can be used, and is usually about 30 percent faster than TCP. However, UDP and "standard" RTP loses too many packets because of insufficient input buffers in the receiving OS and the low priority in the input queues processing on the receiver.

Lawrence Berkeley National Laboratory has developed an extension to RTP (called "RTP+") which includes a simple form of flow control. RTP+ data units ("frames") are transmitted based on a timer, and certain frames are ACKed, where the ACK carries information about packet loss. Lost packets result in increasing the inter-RTP+ packet send time, and this results in RTP+ slowing down until packets are no longer lost. This protocol avoids the overhead of retransmissions, tells the application which packets are lost, and allows the network to apply "back pressure" (effectively a request to the sender to slow down because it is congestion the network) by slowing transmission in the face of packet loss. Some data, such as request lists and control data, require reliable TCP connections. For the image data connection TCP, UDP, and RTP are all supported. The general issue of which transport protocol works best is still under active study.

5.4.2 Controlling Network Delay

Most implementations of TCP in today’s networks use an algorithm invented by Nagle [16]. Nagle wanted to avoid network congestion. Nagle noticed that most packets on networks are small, and since there is a fixed overhead for every packet that goes out onto the network the available bandwidth of the network is not well-utilized. An example of this behavior would be single characters typed on a keyboard on one computer and sent character by character to another computer, as in the case of remote logins. Nagle came up with an algorithm to delay for a small amount of time to see if there was any more data being sent to the same place: if there was, it could be added to the same packet, thereby saving the overhead of a new header. This delay in sending out a packet over the network is just the opposite of what is needed in a "real time" application that will be sending out many small packets (tile requests) over the network. This delay mechanism is embedded in the socket code and is turned on by default; it can only be disabled by setting certain socket options on a socket. This option (TCP_NODELAY) should be noted by anyone who wishes to implement any sort of "Real Time" application over a network.

5.4.3 Semaphores And Shared Memory

Semaphores are used in the ISS to ensure the integrity of each of the "lists" in the application. These lists keep track of what is currently in each server’s "buffer pool", what is waiting to be read from each disk, and what is waiting to be sent to the network. Due to the significant overhead incurred by using semaphores, the application had to be coded so as to minimize the number of semaphore un/locks, yet also to ensure that there were enough semaphores so that no part of the application was unnecessarily blocked by any other. This became quite a challenge, as the number of semaphores is based on the number of lists, which increases with the number of disks being used.

6.0 ISS Support Programs/Systems

This section focuses on the different programs that were created to help develop and debug the ISS. There are several aspects of ISS-related functionality that need to be examined. First, an application simulator is needed to simulate the actual types of requests that the real application would be making. A loader program places the tiles on the correct servers and disks according to the tile-placement algorithm, and creates a lookup table that can be used by the ISS to locate those tiles. Since the data must be stored before being loaded onto the ISS, a Mass Storage System (MSS) was required. In order to do sufficiently accurate timing to characterize the detailed performance of the overall system we needed time stamps for all the different aspects of the ISS that were accurate to within 1 ms. To accomplish this all hosts were synchronized with each other via the Network Timing Protocol(NTP)[19] and time sources based on the Global Positioning Satellite (GPS) system.

6.1 Application Simulator

The TerraVision Application Simulator ("terravis") generates tile requests in the same format as TerraVision and sends them to the ISS master. Terravis tests the performance of the ISS under a variety of possible conditions by allowing the user not only to vary the number of frames per second and number of individual tile requests per frame sent to the ISS, but also to introduce an element of burstyness into the requests stream. Terravis also monitors the performance of the ISS by keeping a running log of the time required to accomplish various tasks in the course of satisfying each tile request.

Terravis consists of three separate modules. The sender module formats and sends the frames of tile requests. The receiver module waits for satisfied tile requests, including data, from the ISS. The master module initiates contact with the ISS, launches the sender and receiver, and coordinates the communications with the ISS. Not only do separate modules enable saturation of the ISS and of the network, but they also allow for the simulator to be altered to send and to receive many different formats of data under various conditions. The simulator has the potential to serve as an intermediary between the ISS and a requesting application, translating between the application’s preferred mode of referencing data and a "standard" format that the ISS would expect.

Terravis can also be used to "play back" a sequence of requests that happen during a session with a real application. Built into the ISS is the capability of capturing a session with any application and recording it to a logfile. This logfile can then be "fed" into terravis and it will duplicate the same sequence of requests. In addition to preserving the exact sequence, terravis is also capable of duplicating the exact timing with which the sequence of requests was delivered to the ISS. It does this by using time stamps that are stored in the logfile with the actual requests that were sent to the ISS.

6.2 ISS Loader

The ISS loader runs only at start-up or when changing to a new geographic area of interest. It retrieves the data associated with a particular area from the Mass Storage System (MSS) and loads (writes) that data to the disks associated with the ISS servers.

The loader determines which data set needs to be loaded: this can be done either through a request from an application, or via a user request at the command line. The loader then finds the locations of the necessary data on the MSS, and uses a tile placement algorithm to determine how the data should be distributed over the available disks and servers. The loader also launches one scribe process for each ISS disk. The loader then enters a cycle of reading data from the MSS, determining where it should be stored, and sends the data to the appropriate scribe, which writes the data to disk.

The tile placement algorithm creates a fast lookup table that the loader saves to a text file. The ISS master in turn calls on a routine to convert the contents of the file back into the fast lookup table so that the tile placement algorithm need only be run once for each configuration of ISS servers and disks.

6.3 Scribe

The ISS scribe process receives tile data from the ISS loader and writes the data onto a specific disk at a specific location given by the tile placement algorithm. The offset of each tile is encoded with the data and transferred to the scribe by the loader. There is one scribe per disk on each host: this allows simultaneous parallel disk writing and thus reduces the time needed to load a data set.

6.4 MSS

The data that is loaded onto the ISS usually comes from a Mass Storage System. The Mass Storage System (MSS) is a hardware/software system for storing large amounts of data. Most of our experience has been with the unitree product, which is an implementation of the proposed IEEE Mass Storage System. The most prominent and useful features of the Unitree MSS are its UNIX interface, the permanent visibility of stored files in what appears as an "infinite" UNIX file system, and the capability of file retrieval without human intervention. That is, users can access the MSS as an NFS mounted file system or through FTP, using normal UNIX and FTP commands, and all files remain visible and accessible to the user indefinitely.

The software component of the MSS is the UniTree Central File Manager (UCFM). UCFM is a collection of processes, called servers, running on a UNIX machine called the storage server. UCFM can also be run as a distributed system. Currently the MSS manages two layers of storage media. At the top is a magnetic disk cache; this is where files enter the system. Files migrate (are copied) to a lower archival layer (tape) soon after they arrive on the disk cache. After files have migrated, they may be purged (removed) from the disk as the cache fills. Files are cached (copied) from tape to the disk when users access them if the files have previously been purged from disk. Files are directly accessible from the cache by NFS (UNIX file system commands) and FTP. When files are modified, they are again migrated to tape, and old versions become inaccessible.

For increased data security the MSS creates multiple tape copies of files as they are migrated. Note that this is not equivalent to conventional backups which may create redundant copies of files over time, and which may permit the recovery of previous, obsolete, versions of files.

6.5 GPS Receivers and NTP

The Global Positioning System (GPS) Receivers and the Network Time Protocol (NTP) synchronize the clocks of all systems to enable accurate monitoring of all traffic across the ATM network. The GPS is a satellite-based radio navigation system that provides precise, continuous, all-weather, world-wide navigation capability for sea, land, and air applications. Twenty-four GPS satellites (21 primary and 3 spares) orbit the earth about once every 12 hours, and are positioned in orbit such that at least four satellites are always visible simultaneously to a GPS receiver at any location on or above the earth. Each satellite carries an atomic clock that is referenced to the NBS primary clock. The number of visible satellites is important because four satellites are needed for a GPS receiver to calculate position and/or time. We are using Magnavox MX 4200 receivers connected to SPARCStations through the serial port.

The approach used by NTP to achieve reliable time synchronization for a set of possibly unreliable remote time servers is as follows: Each server attempts to synchronize to UTC (Universal Coordinated Time) using the best available source and available transmission paths to that source. A group of NTP- synchronized clocks may be close to each other in time, but this is not a consequence of the clocks in the group having synchronized to each other, but rather because each clock has synchronized closely to UTC. NTP operates on the premise that there is one true standard time, and that if several servers which claim synchronization to standard time disagree about what that time is, then one or more of them must be broken. There is no attempt to resolve differences more gracefully since the premise is that substantial differences cannot exist. In essence, NTP expects that the time being distributed from the root of the synchronization subnet will be derived from some external source of UTC (e.g., a radio clock or GPS receiver).

Time is distributed through a hierarchy of NTP servers, with each server adopting a "stratum" which indicates relative distance from an external source of UTC. Stratum-1 servers, which are at the top of the pile (or bottom, depending on your point of view), have access to some external time source, usually a radio clock synchronized to time signal broadcasts from radio stations which explicitly provide a standard time service. A stratum-2 server is one which is currently obtaining time from a stratum-1 server, a stratum-3 server gets its time from a stratum-2 server, and so on. Normally, when all servers are in agreement, NTP will choose the best of these, where "best" is defined in terms of lowest stratum, closest (in terms of network delay) and claimed precision, along with several other considerations.

The ISS uses the xntpd program[18], which is an implementation of the NTP Version 3 specification, as defined in RFC 1305. In general, xntpd’s precision is limited by that of the onboard time-of-day clock maintained by the hardware and operating system, while its stability is limited only by that of the onboard frequency source, usually an uncompensated crystal oscillator in a workstation. On modern RISC-based processors connected directly to radio clocks via serial-asynchronous interfaces, the accuracy is usually limited by that of the radio clock and interface to the order of a few microseconds.

7.0 Experience and Results

Since one of the functions of the ISS is to also act as a monitoring tool for examining the flow of data across a gigabit network testbed, many performance monitoring and feedback routines have been incorporated into its design. These include routines to check the latency at all the various stages of data transfer through the ISS including reading data from the network, writing data to the network, travelling across a segment of the network, finding the data on the different disks that are attached to different hosts, and reading data from those disks.

In Figure 13, "Timestamps in the ISS," on page 71 we see a diagram of the various points in the ISS where the different timestamps are recorded. They are numbered sequentially as to the order in which they would be stamped in an actual request as it traverses the ISS. There is the problem of the fact that different machines in the ISS may have different concepts of the time, so the NTP time protocol described above is used to keep the timers on all the host machines in the ISS in sync

.

7.1 Theoretical Performance Limits

In order to analyze the performance of the ISS software, we first need to examine the theoretical limits of all the hardware components that make up the ISS. The numbers that are described below represent the specifications from the manufacturer and also the actual measurements obtained using 49152 byte tiles (the size currently used by TerraVision) on a Sun SparcStation 10-41. Using these numbers we can obtain a baseline maximum speed that the ISS should be able to operate at.

The first piece of hardware examined was the disk. Seagate Barracuda Disks have a manufactures specification of 7 Mbytes/sec sustained transfer rate. Measurement in a real environment gave a speed of 2.6 MBytes/sec. The reason for this is the size of the data that is being read off of the disks, if the read size is increased to the optimal size (128K in the case of these disks) one can obtain speeds closer to the manufactures claimed specifications (4.7 MBytes/sec) but never what was claimed. What is measured includes a number of OS factors, as well as disk performance. This is true of all the "hardware measurements" here, and the manufactures should not be blamed. It is just the case (assuming that the driver is operating correctly which is not always true) that integration into a real system imposes some overhead.

Fast SCSI host adaptors, have a rated throughput of 10 MBytes/sec. What is typically achieved during tests with one Seagate disk through Sun OS 4.1 is 3.5 MBytes/sec. Using 2 disks on one host adaptor the measured transfer speed was 5 MBytes/sec. If two host adaptors were used on the same machine with two disks on each of the host adaptors, an aggregate transfer rate of 9Mbytes/sec was obtained.

Some other limits that were examined during the platform evaluation was the Sbus (main backbone of a Sparc Station 10). The specifications for the Sbus (and a 75MHz CPU) are a transfer rate of 40 MBytes/sec. The CPU to RAM interconnect (also called the MBus) has a max transfer rate of 105 MBytes/sec. Our measurement of the user space to user space memory copy (memcpy) was only able to achieve 22 MBytes/sec. The maximum rate at which we could write data to the FDDI network interface was 45 MBits/sec using UDP.

Using two Fast-SCSI host adaptors and four disks, and reading random 49 Kbyte tiles from all disks simultaneously, the measured total disk throughput to user memory was 9 Mbytes per second. Adding a simple process which sends UDP packets to the FDDI simultaneously with disk access drops the aggregate disk throughput to 8 MBytes/sec. The network throughput under these conditions is 34 MBits/sec. Under these conditions the total throughput of data moving across the SBus is 12.25 MB/sec. This seems to be the maximum throughput for the for the SparcStation 10-41 with 4 disks, 2 SCSI host adaptors, and 1 FDDI card. This throughput then seems to be a good goal for the ISS since this number is just data flowing across all the buses in the system simultaneously and does not represent coordinated coherent direction of the data as is required in the ISS. This number is obtained from examining Figure 14, ""Typical 2+nd Generation" Workstation (circa 1993)," on page 74. See also the section on Bottlenecks. This limit does not include the ISS overhead of buffer management, semaphore locks, context switching, and other issues described in the previous section. The SCSI host adaptor and SBus are not yet saturated, but adding more disks will not help the overall throughput without a faster network interface.

7.2 Performance Monitoring

One of the design goals of the ISS was to be able to use it to monitor the network that connected the disk servers and application. To accomplish this timing code was inserted into every important part of the ISS to measure the latency, speed and bandwidth. Some of the more important measurements were the different latencies of each segment of the network, the time involved in locating the data, the time for each individual request to read off of disk, and the time it takes to send a request through the TCP/IP protocol stack in the OS.

The reason form the interested in these performance measurements was a matter of optimizing the overall performance of the ISS and interest in characterizing the network that the ISS was running over, in terms of the various latencies, throughput, and how it responds to different levels of traffic, from steady to bursty.

7.3 Actual Performance

The current throughput of a single ISS server on a Sun SPARC 10/41 platform is 7.1 Mbytes/s (55 Mbits/sec) or 91% of the possible maximum of 7.5 Mbytes/s (60 Mbits/s) derived above. This seems a reasonable result considering the additional functionality of the real ISS platform. We have achieved this speed using a TerraVision-like application simulator described above. The 55 MBits/sec is achieved with the simulator sending a list of requests for data at a rate of five request lists per second. Five request lists per second does not force the application to predict and buffer too far into the future, and is not so fast that disk read latency is an issue. The request lists are long enough to ensure that no disk is ever idle. When the ISS receives a request list, all previous requests are discarded. Under these conditions, about one-half of the requests in each request list will never be satisfied either they will be read into the cache but not written to the network, or they will not be read at all before the next request list arrives.

As an example of the operation of the "predict more tiles then can be sent" strategy, consider the following. A typical TerraVision request list contains fifty tiles. Of these fifty tiles forty are read into ISS cache, twenty-five are written to the network, and ten are not processed at all (because the next list arrives). This behavior is reasonable because, as discussed in the section on data path prediction above, the application will keep asking for data until it shows up or is no longer needed. The requesting application will anticipate this behavior, and predict the tiles it needs far enough ahead that "important" tiles are always received by the time they are needed. Requesting more tiles then the ISS can provide before another request is sent allows partial pre-processing of low priority requests if, for example, the sender is idle waiting for a disk seek for high priority data. Tiles are kept in the cache on an LRU basis, and previously requested but unsent tiles will be found in the cache by a subsequent request. The overhead of re-requesting tiles is minimal compared with moving them from disk and sending them over the network.

During ISS operation, the average CPU usage on the disk server platform is 10% user, 60% system, and 30% idle, so the CPU is not a bottleneck. Servicing the TerraVision application with 40 Mbytes of disk cache memory on the ISS server, an average 2% of requested tiles are already in cache. Increasing the cache size will not increase the throughput, but may improve latency with effective path prediction by the application.

7.4 Bottlenecks

The main bottleneck for the application is the speed of moving data in and out of memory. A SparcStation 10 uses 70ns SIMMs (RAM chips), which means that memory copy speed is limited to about 22 MB/s. When writing to the network, the situation is even worse because data are moved to the interface via UNIX "mbufs" [16], adding additional overhead. We have measured the speed of an mbuf copy a 15MBytes/sec, and there are 2 mbuf copies required to write a packet to the network. Along with the other overhead required to assemble packets, this limits the speed with which the interface can write to the network to 4.25 MB/s.

If the network sends were faster, i.e., 19.4 Mbytes/s (155 Mbits/s - the OC-3 rate, ignoring ATM overhead), the next bottleneck would be the disk reading speed, which in this configuration is 9 Mbytes/s (72 Mbits/s). This bottleneck is trivially removed by adding more disks. This brings us back the "memcpy" limit of 22 Mbytes/s as the next bottleneck. The other bottlenecks are not likely to be relevant in the near future. Increasing the speed of workstation memory is the key to increased performance for this application.

7.5 Memory Copy Speed

Since the main bottleneck appears to be memory copy speed, we performed some tests on several high-end workstations including some newer workstations that use interleaved memory. Figure 16, "Memory Speed," on page 78 shows our results.The following systems were tested

: Sun SPARCStation 10/41 (one processor), Sun SPARCserver-1000 (six processors), a DEC Alpha 3000/400 (one processor), an SGI Challenge L (two processors), and an SGI Onyx (four processors).

The first results indicated poor memory copy bandwidth relative to the hardware potential of the memory subsystem for all of the workstations that were considered. Subsequent testing on multiprocessor systems, Figure 16, "Memory Speed," on page 78, showed that the problem apparently lies in the OS or memory controller, because each CPU can get almost the same memory bandwidth simultaneously up to the memory subsystem performance level. In the multiprocessor machines where a single CPU could not saturate the memory subsystem (true for both multiprocessor machines that we tested), the addition of more disks and multiple network adaptors operated by different CPUs should result in linear speedup, up to the memory subsystem bandwidth.

For a description of factors that affect high-speed network I/O, including memory copy speed, see Steenkiste[27].

7.6 Expected Future Performance

Using next generation workstations, most of these bottlenecks are considerably diminished. The most important improvement is that of interleaved memory. For example, a Sun SparcServer 1000 has 2-way interleaved memory, up to 4 Sbuses at 50MBytes/sec and a 250 MByte/sec interconnect. The SGI Challenge L has 8-way interleaved memory, a 320 MB/s HIO bus, and a 1.2 GB/s interconnect system bus. These systems also can be configured with up to 12 processors, one of which could be dedicated to the network write process. Using one of these systems should improve ISS performance considerably. One can see the memory behavior of the Challenge System in Figure 17, "Memory Behavior," on page 80

and now knowing the direct relation between memory copy speed and network bandwidth, it is hoped that the ISS will be able to achieve much greater speeds.

7.7 The Results of Caching

The results of caching data from previous requests show a slight improvement over non-cached requests. The reason for this slight improvemnet is that no application at this time uses the ISS’ ablility to cache data.

Most of the testing so far has shown that only about 2% of the future requests have been in cache, and of these, most were only in cache because they were a re-request for data. The time difference for getting data that is in the cache is tremendous. The average response time for an ISS server that has the request in cache is 3 ms, whereas the time for a request that has to read off disk is usually greater then 41ms, one would hope that the prediction facilities will be used more often in the future.

7.8 TCP/IP Performance

TCP speeds are bounded by the window size divided by the round trip time. The TCP window is the amount of buffer space available on the receiver end of a TCP connection. The larger the buffer space, the more packets the receiver can accept before the host has to process them or tell the sending application to slow down. The buffer size also affects the number of packets that can be outstanding, or "in the pipe" [11]. We have found that with long distance ATM networks, a large TCP window is extremely important, as is expected for a high-bandwidth, large-delay network.

Table 1 shows TCP speeds vs. TCP window size as measured using ttcp in an ATM LAN and ATM WAN environment. This table clearly shows the importance of the TCP window size with ATM networks, especially in the WAN environment when some other factor is not the limit. Using the default TCP window sizes of 24 KBytes (Sun) or 32 KBytes (DEC and SGI), an ATM-based application would only see Ethernet-like speeds.

TCP speed over ATM

Window size

16K

24K

32K

64K

96K

128K

192K

256K

LAN Sun to Sun (Mb/s)

30

34

54

*

*

*

*

*

LAN Alpha to Alpha

62

56

60

110

117

126

118

114

WAN Sun to Sun

11

12

27

37

46

47

47

48

WAN Alpha to Alpha

6.5

7.2

12.5

25

35.9

48.7

72.5

91.8

Note: all speeds for are 64K Byte transfers of data; * = data not available

Alpha to Alpha speeds are courtesy of Joseph Evans, University of Kansas, Lawrence, KS.

ATM interface for Sun (SS 10/41) is SBA-200 from FORE Systems, ATM for Alpha (DEC-3000/400) is the "Otto" card from DEC. ATM switch is from FORE Systems.

Sun to Sun: LAN RTT = 2 ms (through 1 ATM switch), WAN RTT = 8 ms (through 2 ATM switches).

Alpha to Alpha: LAN RTT = 1 ms (no switch), WAN RTT = 16 ms (through 2 ATM Switches).

7.9 Results of RTP vs. TCP

Table 2 shows a comparison of TCP, UDP, and RTP+ (timer driven RTP with back off). In general, UDP is about 30% faster than TCP, and RTP is about 15% slower than UDP.t should be noted that the ATM-WAN tests were done on an essentially empty network. It remains to be tested how the RTP+ protocol will perform on a loaded network where the factors of congestion, lost packets, etc. come into play.I

ATM speeds using TCP, UDP, and RTP

buffer size

1K Bytes

16K Bytes

32K Bytes

64K Bytes

network protocol

TCP

UDP

RTP

TCP

UDP

RTP

TCP

UDP

RTP

TCP

UDP

RTP

FDDI (Mbps)

41

42

35

47

76

44

49

71

46

47

*

47

ATM LAN (Mbps)

27

38

37

78

66

51

78

70

52

79

64

ATM WAN (Mbps)

25

35

47

80

71

50

79

53

50

79

59

Note: All test systems are Sun SS10-51; TCP window size = 48K for LAN and 192K for WAN; UDP speed measured at the receiver.

7.10 Results of Data Placement Algorithm

The data placement algorithm described in Section 5.3 "Tile Placement Algorithm" was compared to a round-robin approach to loading the data on the disks. Then the same request sequence was run on both. Surprisingly, the bottleneck of the networking interface added to the latency of disk reads and seek time is so much worse than the difference in the different algorithms that it was almost impossible to tell a difference between the two algorithms. In sorting through the request lists and the responses, we could see that there was a slight decrease in time for the requests stored with the round robin data placement algorithm.

7.11 Threads

Currently, the ISS server is implemented as a group of loosely-coordinated UNIX processes. We believe performance can be enhanced considerably by transforming these processes into threads. Most of the gains arise from bypassing the slowness of the interprocess communication mechanisms needed to guarantee consistency of resources shared by the processes, e.g., the semaphores needed to ensure non-simultaneous access to the to-read and to-write lists. The same functionality can be achieved using thread-based mechanisms that are designed to be much faster, e.g., mutual exclusion locks.

The ISS server requires separate processes to read from the network, read from disk, and write to the network. These processes must share certain resources, namely, the to-read lists, the to-write list, and the data cache. To ensure fair access to each of these resources, we force some processes to sleep for a short time: by this mechanism, we guarantee that the operating system will perform a context switch. When any ISS server process accesses either a list or the cache it first obtains a semaphore to guarantee exclusive access for the duration of the time it needs to perform its task. If other processes attempt to access the data, they are rejected and must, after a sleep-induced wait, try again.

Multiple threads guarantee exclusive access by using mutual-exclusion locks instead of the expensive semaphore mechanism. The overhead of mutex locks is much less than that of semaphores, and checking mutex locks is much faster. Threads which cannot obtain a needed resource enter into a state of conditional waiting: this state eliminates the cycle of checking for the available resources, sleeping, and checking again, which characterizes processes attempting to gain a shared resource. Threads in conditional wait are simply put to sleep and signaled when the resource is available. Interthread communication is much faster than interprocess communication and threads consume fewer resources, since threads share the same text space with one another.

7.12 Real Time

An application like the Image Server System could benefit from real time scheduling. The ISS currently must attempt to coerce the UNIX scheduler to context-switch between the various competing ISS processes: trying to promote such context switching wastes time and reduces efficiency. A real-time operating system allows fine-grained control of the scheduler by means of thread prioritization and conditional waiting. Effectively, threads can take more or less processor time as necessary instead of arbitrarily taking a fixed slice of CPU time, and reducing competition with kernel-level or other user-level threads. This ability to vary the amount of time used by each thread is especially useful given that the ISS is driven by external events (the requests for the images) and must deliver the images back to the driving application within a predetermined time.

7.13 ATM Networking Issues

The design of the ISS is based upon the ability to use ATM network switches to aggregate cells from multiple physical data streams into a single high-bandwidth stream to the application. Figure 18, "Data Streams Aggregated by ATM switches," on page 86 shows multiple ISS servers being used to form a single high-speed data stream to the application.

Below is a list of what we have learned from our experience using ATM networks. Most of the experience reflected here comes from our work in the MAGIC gigabit testbed[22].

Hardware and Physical Layer:

Link Layer:

Network:

Transport:

One of the things that is becoming apparent in our work with this architecture is that the conventional notion of QOS is not a good method for regulating tightly coupled applications like the ISS, and (for similar reasons) may not be good for distributed-parallel compute server systems. Problems frequently occur when several servers that normally operate asynchronously to provide data to a single source, suddenly synchronize to produce a burst of data that overloads the switch and interface on the single receiver, thereby causing everybody to slow down and retransmit, leading to severe throughput degradation. This is very similar to the problem of routing message synchronization described by Floyd and Jacobson[2].

8.0 Conclusions

The emergence of wide area high-speed networks enables many types of new systems, including distributed parallel data servers. We have designed and implemented a special purpose high-speed data server called the ISS. The ISS is designed to be distributed across multiple hosts with multiple disks on each host and should be capable of scaling to gigabit per second rates. Moreover, we believe that the core design is flexible enough so that only minor modifications need be made to adapt the ISS to different data types and access patterns.

In the process of implementing and using this system many things have been learned about workstation and operating system bottlenecks, and using ATM networks. In particular that memory to memory copy speed is the main I/O bottleneck on today’s workstations, and that ATM networks still have many problems to be worked out before they are ready for general use.

The main result has been to demonstrate an architecture that uses emerging, high-speed WAN’s to support the parallelism necessary to use multiple low cost workstations to provide a high performance I/O system.

References

[1] Doug, Adams, "Interleaved Memory", Usenet, 1994

[2] Floyd, S. and Jacobson, V., "The Synchronization of Periodic Messages", Proceedings of SIGCOMM ‘93, 1993.

[3] ANSI. High-Performance Parallel Interface- Encapsulation of IEEE 802.2 Logical Link Control Protocol Data Units (HIPPI-LE). X3t9/90-119 Draft Rev. 3.1

[4] Bach, M.J. 1986. The Design of the UNIX Operating System. Prentice-Hall, Englewood Cliffs, N.J.

[5] Cavanaugh, J.D, T.J. Salo, "Internetworking with ATM WANs" IEEE "Advances in Local and Metropolitan Area Networks" by William Stallings. 1993.

[6] Chen T. and Rotem D., "Declustering Objects for Visualization", Proc. of the 19th VLDB (Very Large Database) Conference, 1993.

[7] Comer, D.E., and Stevens, D. L. 1991 Internetworking with TCP/IP: Vol II: Design, Implementation, and Internals. Prentice-Hall, Englewood Cliffs, N.J.

[8] Furht, B., D. Grostick, D. Gluch, G. Rabbat, J. Parker and M. McRoberts. "REAL-TIME UNIX SYSTEMS Design and Application guide", Kluwer Academic Publishers.1991

[9] Ghandeharizadeh, S. and Ramos, L, "Continuous Retrieval of Multimedia Data Using Parallelism, IEEE Transactions on Knowledge and Data Engineering, Vol 5, No 4, August 1993.

[10] Hartman, J. H, John K. Ousterhout 1992. "Zebra: A Striped Network File System" UNENIX Association File System Workshop, pp 71-77

[11] Jacobson, V., Braden, R.T., and Borman D.A. "TCP Extensions for High Performance," RFC 1323, LBL, 1992.

[12] Johnston, W., Allen, A., M.D., "Regional Health Care Information Systems: Motivation, Architecture, and Implementation", Lawrence Berkeley Laboratory report no. 34770, Berkeley, CA, 94720.

[13] Kernighan, B.W., and Ritchie, D.M. 1984. The UNIX Programming Environment, Second Edition. Prentice-Hall, Englewood Cliffs, N.J.

[14] Langberg, M., "Silicon Graphics Lands Cable Deal with Time Warner Inc.", San Jose Mercury News, June 8, 1993.

[15] Leclerc, Y.G. and Lau, S.Q., Jr.,"TerraVision: A Terrain Visualization System", SRI International, Technical Note #540, Menlo Park, CA, 1994.

[16] Leffler, S.J., McKusick, M.K., and Quaterman, J.S. "The Design and Implementation of the 4.3BSD UNIX Operating System", Addison-Wesley, Reading, Mass., 1989.

[17] Malamud, C., "Stacks: Interoperability in Today’s Computer Networks", Prentice-Hall, Englewood Cliffs, NJ, 1992.

[18] Mills, D., "Simple Network Time Protocol (SNTP)", RFC 1361, University of Delaware, August 1992.

[19] Mills, D., "Network Time Protocol (Version 3) specification, implementation and analysis", RFC 1305, University of Delaware, March 1992.

[20] Partridge, C. 1994. Gigabit Networking. Addison-Wesley, Reading, Mass.

[21] Patterson D., Gibson G., and Katz R., "A Case for Redundant Arrays of Inexpensive Disks (RAID)," in Proc. 1988 SIGMOD Conf., June 1988.

[22] Richer, I and Fuller, B.B, "An Overview of the MAGIC Project," M93B0000173, The MITRE Corp., Bedford, MA, 1 Dec. 1993.

[23] Rowe L. and Smith B.C, "A Continuous Media Player", Proc. 3rd International Workshop on Network and Operating System Support for Digital Audio and Video, San Diego, CA, Nov. 1992.

[24] Rowe L., "Video Compression", Invited Talk, Usenix Conf, San Francisco, 1994.

[25] Schulzrinne, H. and Casner S., "RTP: A Real-Time Transport Protocol", Internet Draft, Audio/Video Transport Working Group of the IETF, 1993.

[26] Skibo, T. M. 1992. Using the High-Performance Parallel Interface for Network Traffic, Master Thesis, University Illinois at Urbana-Champaign.

[27] Steenkiste, P.A., "A Systematic Approach to Host Interface Design for High Speed Networks", IEEE Computer, Vol 27, No 3, March 1994.

[28] Stevens, W. R. 1990. UNIX Network Programming. Prentice-Hall, Englewood Cliffs, N.J.

[29] Stevens, W. R. 1992. Advanced Programming in the UNIX Environment. Prentice-Hall, Englewood Cliffs, N.J.

[30] Stevens, W. R. 1994.TCP/IP Illustrated, Volume 1, The Protocols. Prentice-Hall, Englewood Cliffs, N.J.

[31] Tanenbaum, A. S. 1989. Computer Networks, Second Edition. Prentice-Hall, Englewood Cliffs, N.J.

[32] Tierney, B., Johnston W., Herzog, H., Hoo, G., Jin, G., and Lee, J. R.,"System Issues In Implementing High Speed Distrubted Parallel Storage Systems,", Proceedings of the USNIX Symposium on High Speed Networking, Aug. 1994, LBL-35775.

[33] Tierney, B., Johnston, W. Chen, L. T., Herzog, H., Hoo, G., Jin, G., Lee, J. R., and Rotem, D., "Using High Speed Networks to Enable Distributed Parallel Image Sever Systems", Proceedings of Supercomputing ‘94, Nov. 1994, LBL-35437.

[34] Tierney, B., Johnston, W. Chen, L. T., Herzog, H., Hoo, G., Jin, G., Lee, J. R., and Rotem, D., "Using High Speed Networks to Enable Distributed Parallel Image Sever Systems", Proceedings of Supercomputing ‘94, Nov. 1994, LBL-35437

[35] Tolmie, D., and J. Renwick [1993]. "HIPPI: Simplicty Yieds Success," IEEE Network Magazine, Vol 7 No. 1, January 1993, pp. 28-33