The Research Computing Center of Lomonosov Moscow State University is developing the Octotron software suite for automatic monitoring and mitigation of emergency situations in supercomputers so as to maximize hardware reliability. The suite is based on a software model of the supercomputer. The model uses a graph to describe the computing system components and their interconnections. One of the most complex components of a supercomputer that needs to be included in the model is its communication network. This work describes the proposed approach for automatically discovering the Ethernet communication network topology in a supercomputer and its description in terms of the Octotron model. This suite automatically detects computing nodes and switches, collects information about them and identifies their interconnections. The application of this approach is demonstrated on the “Lomonosov” and “Lomonosov-2” supercomputers.
Automatic discovery of the communication network topology for building a supercomputer model
Sergey Sobolev, Konstantin Stefanov, Vadim Voevodin; Automatic discovery of the communication network topology for building a supercomputer model. AIP Conf. Proc. 20 October 2016; 1776 (1): 090014. https://doi.org/10.1063/1.4965378
Download citation file: