Main Page
From NetLogger
Anyone who has ever tried to debug or do performance analysis of complex distributed applications knows that it can be a very difficult task. Problems and bottlenecks may be in the many various software components, hardware components, networks, operating systems, and so on.
NetLogger is designed to make debugging and performance analysis of complex distributed applications easier. It has several different components:
- NetLogger Methodology: A methodology, summarized in our "Logging Best Practices" document, for analyzing distributed systems, which includes a simple, common message format for all monitoring events which includes high-precision timestamps.
- NetLogger storage and retrieval tools: Tools for distributed log collection, normalization, and insertion into a relational database.
- NetLogger visualization and analysis tools: Using Python and R programs to flexibly visualize and analyze NetLogger logs.
- NetLogger client API library: C language (with Java, Python) that you add to your existing source code to generate monitoring events. A unique feature of this library is the NetLogger summarization module (C language only), that performs real-time streaming summarization of high-throughput logging streams.
Documentation of these components (except the Logging Best Practices document linked to above) can be found in the NetLogger Manual, which is available from the wiki Documentation page.
Sample results
Example: Lifelines
One type of visualization that we have found very useful is what we call the "lifeline". A lifeline is good in the case where you have data or control that flows through several stages of a system, and you want to track the time spent in, and between, each stage. To plot the lifeline, the X-axis represents wallclock time and the Y-axis shows the discrete stages in the system, with the first stage at the bottom and the last stage at the top. Lifeline analysis is very intuitive, as the slope of the line indicates where the most time is being spent.
The above picture is an example of NetLogger 's powerful ability to correlate the application and middleware instrumentation data with the host and network monitoring data. The figure above shows the result of such an analysis. This plot, generated by the NetLogger visualization tool, nlv, correlates client and server instrumentation data with CPU and TCP retransmission monitoring data. The events being monitored are shown on the y-axis, and time is on the x-axis. From bottom to top, one can see CPU utilization events, application events, and TCP retransmit events all on the same graph. Each semi-vertical line represents the life of one block of data as it moves through the application. The gap in the middle of the graph, where only one set of header and data blocks are transferred in three seconds, correlates exactly with a set of TCP retransmit events. The plot makes it easy to see that the pause in the transfer is due to TCP retransmission errors on the network. Thus, we determine that our performance problems are due to network congestion. Tracking down the cause of the network congestion is a difficult problem, and requires a large amount of network monitoring data.
Example: Workflow performance profiling
The above plot was generated from a NetLogger database created from the logs of a large distributed workflow (~1M events) for a computation by the CyberShake group of the Southern California Earthquake Center. It shows the variation in the running time of the main computational task of the workflow, across the 40 sub-workflows. The workflow was created by the Pegasus workflow management tool.
Links
Welcome to the new NetLogger Wiki. These pages are still under construction. The old NetLogger web site is still available at http://dsd.lbl.gov/NetLogger.old/
For time-ordered types of content like "current projects" and "new ideas", see the NetLogger Blog.


