The Research Computing Center of Lomonosov Moscow State University is developing the Octotron software suite for automatic monitoring and mitigation of emergency situations in supercomputers so as to maximize hardware reliability. The suite is based on a software model of the supercomputer. The model uses a graph to describe the computing system components and their interconnections. One of the most complex components of a supercomputer that needs to be included in the model is its communication network. This work describes the proposed approach for automatically discovering the Ethernet communication network topology in a supercomputer and its description in terms of the Octotron model. This suite automatically detects computing nodes and switches, collects information about them and identifies their interconnections. The application of this approach is demonstrated on the “Lomonosov” and “Lomonosov-2” supercomputers.
Skip Nav Destination
Article navigation
20 October 2016
NUMERICAL COMPUTATIONS: THEORY AND ALGORITHMS (NUMTA–2016): Proceedings of the 2nd International Conference “Numerical Computations: Theory and Algorithms”
19–25 June 2016
Pizzo Calabro, Italy
Research Article|
October 20 2016
Automatic discovery of the communication network topology for building a supercomputer model
Sergey Sobolev;
Sergey Sobolev
a)
1
Research Computing Center of Lomonosov Moscow State University
119991, Leninskie Gory, 1, bld. 4, Moscow, Russia
Search for other works by this author on:
Konstantin Stefanov;
Konstantin Stefanov
b)
1
Research Computing Center of Lomonosov Moscow State University
119991, Leninskie Gory, 1, bld. 4, Moscow, Russia
Search for other works by this author on:
Vadim Voevodin
Vadim Voevodin
c)
1
Research Computing Center of Lomonosov Moscow State University
119991, Leninskie Gory, 1, bld. 4, Moscow, Russia
Search for other works by this author on:
AIP Conf. Proc. 1776, 090014 (2016)
Citation
Sergey Sobolev, Konstantin Stefanov, Vadim Voevodin; Automatic discovery of the communication network topology for building a supercomputer model. AIP Conf. Proc. 20 October 2016; 1776 (1): 090014. https://doi.org/10.1063/1.4965378
Download citation file:
Sign in
Don't already have an account? Register
Sign In
You could not be signed in. Please check your credentials and make sure you have an active account and try again.
Sign in via your Institution
Sign in via your InstitutionPay-Per-View Access
$40.00