Quarterly Report for the Distributed Monitoring Framework (DMF) project, April 2002
Progress: Oct 2001 to April 2002
The focus of the DMF project for FY02 was to lead efforts in the HEP and GGF communities to define requirements for a Grid monitoring system, to finish a prototype GMA implementation, to improve the performance and fault tolerance of NetLogger, and to design and implement a prototype monitoring event archive. More details on each of these topics follow.
In collaboration with the Global Grid Forum (GGF), we finished the specification of the "Grid Monitoring Architecture" (GMA). We also completed a prototype implementation of the GMA, based on SOAP and Python, called pyGMA. We continued to work with the GGF to define standard event names, attributes, and schemas for Grid Monitoring. We also worked with Keith Jackson on a python-based GSI extension to SOAP, and will soon integrate this into pyGMA.
Brian Tierney is co-leading the PPDG monitoring group (http://www-unix.mcs.anl.gov/~schopf/pg-monitoring/). This helped us to better define requirements for the DMF. This group is working on a requirements document, the current draft of which is at http://www-unix.mcs.anl.gov/~schopf/pg-monitoring/ReqDoc/monitoring_requirments.v1.pdf.
This group will work to the use of the currently being defined unified schemas being developed by the Glue-Schema group (discussed below), the development of needed sensors or information providers to allow interpretable deployment of this information.
Brian Tierney is also co-leading the "Glue Schema Work Group", which is tasked to define common schemas for inter operability between the EU physics grid projects (focusing on EDG and DataTag) and the US physics Grid projects (focusing in on PPDG, GriPhyN and iVDGL). The web page for this project is _http://www.hicb.org/glue/glue-schema/schema.htm. This work is part of the Grid Laboratory Uniform Environment (GLUE) Phase I task (http://www.hicb.org/glue/GLUE-v0.04.doc).
The first step is to define common schemas to describe Compute Elements (CE), Storage Elements (SE), and Network Elements (NE), to be used by the MDS and R-GMA Grid Information Services. The goal is to have common schemas defined, deployed, and tested in time for the EU DataGrid Testbed 2 release in September 2002. Common schemas for monitoring and notification events are being address by the Global Grid Forum DAMED working group, and will be addressed later by iVDGL and DataTag. Good progress is being made at defining a common schema for the "Compute Element" at this time. For more information see the PPDG quarterly report.
We designed and implemented a prototype event archive for monitoring data, based on an open source relational database (mySQL). We defined schemas for NetLogger events, enabling the archive to be updated and queried in an efficient manner. We also developed a "archive feeder" for NetLogger that reads events from the network, buffers them on disk, and then feeds them in batches to the database. This improves the efficiency of the database, and ensures that incoming event data does not overload the database so that it does not become a bottleneck. The archive also supports the GMA consumer and producer interfaces. We also developed a web-based interface to the event archive.
We have made many changes and improvements to NetLogger. First, we added a new, very efficient, binary logging format. This supplements the existing ASCII-based log format, but is over 4 times faster. Details are in a paper submitted to the 2002 High Performance Distributed Computing Conference and are available as an LBNL report. We also added a fault tolerance mechanism, so that if NetLogger is sending events to a remote host and encounters a problem, the events will then be saved to local disk, and NetLogger will automatically try to reconnect to the remote host periodically.
We are working on adding a "trigger interface" to NetLogger, and wrote a GMA-based "activation service" to set the trigger. This works as follows: a consumer sends a request to an activation service to start monitoring a particular type of event. The activation service creates an entry in a "trigger file". The application, via NetLogger library calls, periodically checks for updated trigger files, and starts logging the specified events. The activation service buffers, filters, and forwards the requested event data back to the consumer.
We were invited to give numerous talks and tutorials on various aspects of our work. This includes several NetLogger tutorials, and talks at a Python conference, the Internet 2 End-to-End Monitoring Workshop, and the ESNet ESCC meeting.
We have been extremely active in the Global Grid Forum. Dan Gunter is co-chair of the GMA working Group, and Brian Tierney is co-chair of the Network measurements working group. We are also involved with several other groups, including Event Schema, Remote IO, Information Services, Network Research, Architecture, and Event Notification groups.
We have also been collaborating with several groups, including NLANR, EU DataGrid, Globus, and the IEPM project at SLAC on the possible use of NetLogger to collect monitoring data for their projects.
We have also been working closely with the Globus project to define instrumentation and monitoring services for Globus and the new "Open Grid Services Architecture" (OGSA), based on NetLogger.
We worked with several groups to help add NetLogger instrumentation to their software. This includes the DO/SAM software at FNAL, Andy Hanushevsky's bbcp (SLAC), and the Data Management Group at CERN.