Русский

Conference publications

Abstracts

XXII conference

Using control charts for monitoring of Chebyshev supercomputer

Nikolsky I.M.

Moscow, Sparrow hills, GSP-2

1 pp. (accepted)

Parallel computations gain increasingly wide application in solving scientific

problems. Supercomputer clusters work 24 hours a day, 7 days a week.

Failure of one or several computational nodes may stop computational process

for some years or even days. Hence the problem of monitoring of multiprocessor

computer turns to be an important one.

Supercomputer "Chebyshev", a part of MSU supercomputer complex, is equipped with

a monitoring system, comprising detectors which periodically measure different

supercomputer characteristics (average cpu load, percentage of running jobs,

etc). Time series, produced by these detectors, help to monitor the current

state of supercomputer. Technical malfunctions may be revealed by detecting

anomalous segments of these time series.

This report contains results of analysis of data obtained from "Chebyshev".

Diferenet types of control charts (Western Electrical Rules,

Shewhart, EWMA and CUSUM) were chosen as main analytical tools.

Analysis was performed using R statistical environment.

It was shown that control charts are quite effective for detecting abnormal

behaviour of different supercomputer metrics.

Literature

1. Gerhard Munz and Georg Carle. Application of forecasting techniques and

control charts for traffic anomaly detection // In proceedings of the 19th

ITC Specialist Seminar on Network Usage and Traffic, Berlin, Germany, October 2008

2. Celso Mendes and Daniel Reed Monitoring large systems via statistical sampling //

International Journal of High Performance Computing Applications, May 2004, 18, p. 267-277,

3. Douglas Montgomery Introduction to statistical quality control // Wiley,

6th edition, 2008, 734 p.



© 2004 Designed by Lyceum of Informational Technologies №1533