|
Conference publicationsAbstractsXXII conferenceUsing control charts for monitoring of Chebyshev supercomputerMoscow, Sparrow hills, GSP-2 1 pp. (accepted)Parallel computations gain increasingly wide application in solving scientific problems. Supercomputer clusters work 24 hours a day, 7 days a week. Failure of one or several computational nodes may stop computational process for some years or even days. Hence the problem of monitoring of multiprocessor computer turns to be an important one.
Supercomputer "Chebyshev", a part of MSU supercomputer complex, is equipped with a monitoring system, comprising detectors which periodically measure different supercomputer characteristics (average cpu load, percentage of running jobs, etc). Time series, produced by these detectors, help to monitor the current state of supercomputer. Technical malfunctions may be revealed by detecting anomalous segments of these time series.
This report contains results of analysis of data obtained from "Chebyshev". Diferenet types of control charts (Western Electrical Rules, Shewhart, EWMA and CUSUM) were chosen as main analytical tools. Analysis was performed using R statistical environment. It was shown that control charts are quite effective for detecting abnormal behaviour of different supercomputer metrics.
Literature 1. Gerhard Munz and Georg Carle. Application of forecasting techniques and control charts for traffic anomaly detection // In proceedings of the 19th ITC Specialist Seminar on Network Usage and Traffic, Berlin, Germany, October 2008 2. Celso Mendes and Daniel Reed Monitoring large systems via statistical sampling // International Journal of High Performance Computing Applications, May 2004, 18, p. 267-277, 3. Douglas Montgomery Introduction to statistical quality control // Wiley, 6th edition, 2008, 734 p.
|