The Research Computing Center of Lomonosov Moscow State University is developing the Octotron software suite for automatic monitoring and mitigation of emergency situations in supercomputers so as to maximize hardware reliability. The suite is based on a software model of the supercomputer. The model uses a graph to describe the computing system components and their interconnections. One of the most complex components of a supercomputer that needs to be included in the model is its communication network. This work describes the proposed approach for automatically discovering the Ethernet communication network topology in a supercomputer and its description in terms of the Octotron model. This suite automatically detects computing nodes and switches, collects information about them and identifies their interconnections. The application of this approach is demonstrated on the “Lomonosov” and “Lomonosov-2” supercomputers.

1.
A.
Antonov
,
D.
Nikitenko
,
P.
Shvets
,
S.
Sobolev
,
K.
Stefanov
,
V.
Voevodin
,
V.
Voevodin
and
S.
Zhumatiy
,
Parallel Processing and Applied Mathematics
.
11th International Conference, PPAM 2015
,
Krakow, Poland
,
September 6-9, 2015
. Revised Selected Papers, Part I (
Springer International Publishing
,
2016
), pp.
12
22
.
2.
Octotron
” core repository, see https://github.com/srcc-msu/octotron_core.
4.
Netdisco documentation
, see https://metacpan.org/pod/App::Netdisco.
5.
Documentation on Zabbix software
, see http://www.zabbix.com/ru/documentation/.
6.
Python framework for model creation for “Octotron” project
, see https://github.com/srcc-msu/octotron.
7.
J.
Case
,
M.
Fedor
,
M.
Schoffstall
and
J.
Davin
,
Simple Network Management Protocol (SNMP) description, RFC 1157
,
1990
, see http://tools.ietf.org/html/rfc1157.
8.
LLDP protocol description (certified as IEEE 802.1AB-2009)
, see http://standards.ieee.org/findstds/standard/802.1AB-2009.html.
This content is only available via PDF.
You do not currently have access to this content.