This monitor tests the bandwidth and the latency of node-to-node communication between different nodes by invoking the MPI Ping-Pong: Lightweight Throughput diagnostic test (mpipingpong.exe) that is included in Windows HPC Server 2008 R2. The monitor is set to run daily by default.
This monitor will enter the Warning state if the MPI Ping-Pong diagnostic test fails on the cluster.
The Warning state is caused by either network congestion or network connectivity issues in a cluster. A Warning state indicates that at least one node is performing poorly relative to the other nodes in the cluster. A poorly performing node meets both of the following criteria:
The average latency/throughput over all network links for the node is at least one standard deviation away from the mean value for the cluster. AND
The latency is at least 20% higher or the throughput is at least 20% lower than the cluster mean. This avoids unwarranted warnings on highly uniform cluster networks.
To troubleshoot and fix this problem:
Check the network connectivity for all the nodes in the cluster.
Check the diagnostic test results in HPC Cluster Manager (in Diagnostics), where detailed information can be obtained about the failure.