Sire is a Python/C++ library that is used both to prototype new algorithms and as an interoperability engine for exchanging information between molecular simulation programs. It provides a collection of file parsers and information converters that together make it easier to combine and leverage the functionality of many other programs and libraries. This empowers researchers to use sire to write a single script that can, for example, load a molecule from a PDBx/mmCIF file via Gemmi, perform SMARTS searches via RDKit, parameterize molecules using BioSimSpace, run GPU-accelerated molecular dynamics via OpenMM, and then display the resulting dynamics trajectory in a NGLView Jupyter notebook 3D molecular viewer. This functionality is built on by BioSimSpace, which uses sire’s molecular information engine to interconvert with programs such as GROMACS, NAMD, Amber, and AmberTools for automated molecular parameterization and the running of molecular dynamics, metadynamics, and alchemical free energy workflows. Sire comes complete with a powerful molecular information search engine, plus trajectory loading and editing, analysis, and energy evaluation engines. This, when combined with an in-built computer algebra system, gives substantial flexibility to researchers to load, search for, edit, and combine molecular information from multiple sources and use that to drive novel algorithms by combining functionality from other programs. Sire is open source (GPL3) and is available via conda and at a free Jupyter notebook server at https://try.openbiosim.org. Sire is supported by the not-for-profit OpenBioSim community interest company.
I. INTRODUCTION AND FEATURES
Sire is a molecular information and interoperability engine that has been under continuous active development since 2005. The software has two design goals:
To support rapid prototyping of new algorithms and ideas.
To act as an “interoperability glue” for moving molecular information between programs and thereby gluing them together in scripts that combine their functionality.
The software was designed to be a library that made it easier to write other programs, rather than a program in and of itself. As such, it was never meant to be directly visible or directly used by researchers. Instead, researchers use software that has been enabled by sire and that invisibly uses sire “under the hood.” This design philosophy was an inspiration for its name; sire being a single-syllable short word that means “to parent,” i.e. sire would be the “parent” of other programs. Software developed using sire includes waterswap,1–3 quantomm,4 nautilus,5 somd,6 FESetup,7 and most notably BioSimSpace.8,9 Commercially, sire is embedded in Flare, produced by Cresset,10–12 and within the drug discovery pipelines used by Exscientia.ai (see https://github.com/Exscientia/biosimspace).
While sire started as a tool for prototyping, at present, it is mostly used as a molecular information and interoperability engine. Its design, discussed below, supports reading, editing, searching, and writing of molecular information from a variety of molecular input file formats and software. It has a highly extensible and flexible molecular information engine and comes with converters that interconvert the information in this engine with molecular objects from packages such as RDKit,13 OpenMM,14 Gemmi,15 and NGLView.16 This empowers researchers to follow the FAIR philosophy,17 providing tools to help make molecular data accessible and scripts processing it interoperable and reproducible. Researchers can use the sire application programming interface (API) to, for example, load molecules from PDBx/mmCIF files via Gemmi, perform SMARTS searches via RDKit, run GPU-accelerated molecular dynamics via OpenMM, and then view the resulting molecular trajectories in the NGLView Jupyter notebook 3D molecular viewer. This functionality is used by BioSimSpace,8 which is built directly on sire and uses sire’s molecular information engine to interconvert with programs such as GROMACS,18 NAMD,19 Amber,20 and AmberTools21 for automated molecular parameterization and running of molecular dynamics, metadynamics, and free energy workflows. Current developments are now integrating machine learning potentials via tools such as emle-engine22 and pytorch,23 thereby bringing the potential of AI-assisted simulation to application in industrial drug discovery pipelines.
At present, the landscape of molecular modeling software is very different to when development on sire started in 2005. Back then, contemporary programs were distributed as monolithic suites, e.g., GROMACS with associated tools,18 CHARMM,24 sander/pmemd with the Amber suite,20 NAMD with VMD,19 etc. With the exception of GROMACS, which was GPL,25 these suites had proprietary licenses that made open sharing of modifications difficult. It was challenging to develop and distribute new molecular simulation tools that involved functionality drawn from multiple suites. There was pressure in a research group to choose one suite of tools and then do all research and algorithm development within that suite. In contrast, a new way of developing computational chemistry software was being pioneered by OpenBabel26 and PyMol.27 OpenBabel provided an open source toolkit for loading and manipulating molecules, and PyMol provided a (then) open source toolkit for molecular visualization. Sire’s development was inspired by these very open, toolkit-style projects. Instead of a monolithic suite, sire would be an open source library of building blocks that could be combined together in Python with other libraries to empower researchers to create their own bespoke applications. Development targeted what was then an unoccupied niche, namely, a toolkit to enable rapid prototyping of Monte Carlo-based molecular simulation algorithms. As the 2000s progressed, many research groups independently also started the development of open source toolkits that targeted other niches. For example, RDKit, which started around 2008, provides an excellent collection of tools for cheminformatics. MDAnalysis28 (started around 2008) and MDTraj29 (started around 2012) both provide powerful building blocks for trajectory editing and analysis. OpenMM (started around 2010) is a toolkit for graphics processing unit (GPU)-accelerated molecular dynamics simulations. As sire developed, we aimed to be complementary to these tools, e.g., working with OpenMM instead of developing our own GPU-accelerated molecular dynamics functionality, and now creating interconverters to enable functionality from many toolkits to be easily combined together. At present, researchers have a rich ecosystem of open source molecular toolkits available and accessible within Python. These span from quantum chemistry frameworks such as Psi4,30 through the aforementioned cheminformatics (OpenBabel, RDKit), trajectory analysis (MDAnalysis, MDTraj), visualization (NGLView), and simulation (OpenMM) packages, to newer toolkits for crystallography (Gemmi), forcefield development (OpenFF Interchange31), and reading/writing molecular file formats and associated molecular editing (Chemfiles32 and Parmed33). Sire’s functionality complements and overlaps with the functionality of many of these packages. This paper will describe what sire is, what it can do, and how it has evolved into an interoperability engine that can be used to combine functionality from many packages within a single script.
II. THEORETICAL BASIS AND SOFTWARE DESIGN PHILOSOPHY
The foundation of sire is a powerful C++ Property system. In this system, nearly every object derives from Property, and most objects are structured as dictionaries of arbitrary Property objects. This system is exposed to Python via a set of automatically generated Py++34/boost::python wrappers.35 Researchers using sire write Python scripts that can arbitrarily create and combine collections of Property-derived objects.
A key set of these objects are those that comprise the molecular information engine in the sire.mol Python sub-module. The information that makes up a single molecule is held in a MoleculeData object (Fig. 1). This information is divided into two parts:
MoleculeInfo. This gives the names of the atoms that make up the molecule and (optionally) their arrangement into residues, chains, and segments.
Molecular properties. This is an arbitrary collection of Property objects, indexed by user–supplied keys, which can be attached to either the whole molecule or to subsets of the molecule (e.g., atoms, residues, bonds, angles, chains, etc.). Example properties include AtomCoords, stored at key “coordinates” to hold atomic coordinates; AtomElements, stored at key “element” to hold chemical elements; BondOrder, stored at key “order” for bond orders; and Connectivity, stored at key “connectivity” for molecular connectivity.
Classes used to hold and access data about a molecule. Data are held as an arbitrary collection of properties in a Properties object held by a single MoleculeData for the molecule. This also holds MoleculeInfo, which gives the names and numbers of atoms, residues, chains, and segments in the molecule and how they are arranged (e.g., which atoms are in which residues). These data are viewed by viewer objects that hold an implicitly shared pointer to the MoleculeData, e.g., a Molecule provides a view of the entire molecule. The Residue here views the sixth residue in the molecule, and the Atom here views the eighth atom in the molecule, with both views each holding a pointer to the same MoleculeData.
Classes used to hold and access data about a molecule. Data are held as an arbitrary collection of properties in a Properties object held by a single MoleculeData for the molecule. This also holds MoleculeInfo, which gives the names and numbers of atoms, residues, chains, and segments in the molecule and how they are arranged (e.g., which atoms are in which residues). These data are viewed by viewer objects that hold an implicitly shared pointer to the MoleculeData, e.g., a Molecule provides a view of the entire molecule. The Residue here views the sixth residue in the molecule, and the Atom here views the eighth atom in the molecule, with both views each holding a pointer to the same MoleculeData.
The MoleculeData class is not publicly accessible from the API. Instead, it is held via copy-on-write implicitly shared pointers36 by the viewer and editor classes. For example, the Atom and AtomEditor classes provide a view and an editor for information relating to a single atom in the molecule. Similarly, the Residue and ResEditor classes provide a view and editor for residue-level information. A higher level of classes that implement a “Selector” interface implement collections (containers) of views. These are named using Selector… for views within a molecule and SelectorM… for views across multiple molecules. For example, atoms within a molecule are held in SelectorAtom, bonds in a molecule are held in SelectorBond, residues across lots of molecules are held in SelectorMResidue, and selections of whole molecules are held in SelectorMol.
A design goal is that this hierarchy of classes, editors, and viewers should be mostly invisible to researchers. The sire API is designed to create objects of these classes automatically based on what the researcher needs to do. For instance, calling the .edit() function on a view of an Atom will automatically create and return the AtomEditor of that view. Searching for all the bonds in a molecule will automatically return a SelectorBond view of all of the bonds. Because all of these views and editors hold only implicitly shared copy-on-write pointers to the underlying MoleculeData object, creating and navigating views is extremely fast and memory-efficient.
This data structure architecture is primarily exposed to the researcher via a powerful molecular search engine (Fig. 2). This is built using a custom grammar via boost::spirit37 and is complemented by SMILES38 and SMARTS searching enabled by RDKit.13 The search engine empowers researchers to, for example, find all bonds between carbon and nitrogen atoms in a collection of molecules (mols[”bonds from element C to element N”]), find all molecules that have been given an is_perturbable property, which marks them as able to be perturbed during an alchemical free energy calculation (mols[”molecules with property is_perturbable”]), all residues called “ASP” that contain exactly one atom called “CA” (mols[”resname ASP and residues with count(atomname CA) == 1”]), or all molecules that match the SMILES string for alanine dipeptide (mols[”smiles CNC(=O)C(C)NC(C)=O”]). The complete grammar is extensively documented.39
A Jupyter notebook in which sire has been used to load PDB structure 3NSS. Different search strings are used to highlight different parts of the loaded molecules.
A Jupyter notebook in which sire has been used to load PDB structure 3NSS. Different search strings are used to highlight different parts of the loaded molecules.
The sire.io submodule includes molecular file parsers for several molecular parameters and coordinate files. These parsers are symmetrical, meaning that they are designed to read and write the same amount of information (with the caveat that ambiguity in the way some file formats can present information means that the written output may not be identical to the read input). Molecular information can be read into a MoleculeData object from, e.g., a combination of an Amber topology file and PDB coordinate file, and then edited using the sire API before being written back to a Gromacs topology and DCD coordinate file. The design goal of invisibility also applies to these parsers. The sire.load and sire.save functions automatically choose the right parser based on the contents of the file being loaded or the requested format for the save. The sire.load function will look at the file extension to make an initial guess of a parser. If this fails, it will, in parallel, try to load the file using all of the parsers in sire.io. Multiple files can be loaded at once, meaning that researchers do not need to think about the specifics of individual file formats. For example, the command sire.load(”trajectory.something,” “topology.something”) will work, as long as one of these files contains a molecular topology and the other contains coordinates (or a trajectory). The sire.io framework will do the job of discovering the formats of these files and which one contains topology data and which one contains coordinate/trajectory data.
Trajectories are handled natively, either as collections of single-coordinate frame files (such as “*.pdb”) or dedicated trajectory files (such as “coords.dcd”). Multiple trajectory files of different formats can be concatenated just by adding them to the sire.load call, e.g., sire.load(“topology.top,” “frame0.pdb,” “frame1.pdb,” “frames.dcd,” “frames.xtc”). This, combined with the molecular search engine, makes it very easy to use the sire API to process the trajectory of a system containing many molecules and to extract specific frames or specific subsets of atoms while simultaneously applying molecular smoothing or alignment. Trajectory frames are loaded lazily, meaning that frames are only read into memory as needed. The trajectory loaders are also parallelized, both internally, meaning that multiple processor cores can work together to load data from a single frame into all the molecules, and externally, meaning that if multiple frames are requested, they can be read in parallel. This enables rapid processing of huge trajectories that are larger than available memory, with multiple frames analyzed and lazily loaded in parallel using all of the available processor cores. Quantities such as root mean square deviations (RMSDs), distances, angles, etc. can be readily calculated across frames using built-in analysis tools. These tools are mostly accessed via member functions of the views, e.g., mols.trajectory().rmsd() would calculate the RMSD across all frames of the trajectory associated with the molecules views in mols. Or angles.trajectory().measures() would calculate angle sizes across all frames of the trajectory associated with the angle views in angles. This approach extends to supporting Python lambda functions, e.g., mols[“element C”].trajectory().apply(lambda atoms: sire.measure(atoms[0], atoms[1])) would apply the passed lambda function to every frame of the trajectory of the carbon atoms (here, calculating the distance between the first two carbon atoms).
Sire also has a complete energy engine, meaning that energies can be calculated for entire systems or views within systems across all frames or subsets of frames of a trajectory. This is used in BioSimSpace to calculate different types of protein–ligand restraints used in alchemical absolute binding free energy calculations.9,40 These engines are powered by a complete internal Computer Algebra System (CAS) in sire.cas. Data, such as bond and angle potentials, are stored as algebraic expressions. These are converted to and from parameter values (e.g., k and r0 for bonds) as and when needed. This gives substantial flexibility to all aspects of the program; for example, sire.cas powers the user-supplied equations that control how forcefield parameters are perturbed across a λ-coordinate during a GPU-accelerated OpenMM molecular dynamics free energy simulation.
The sire.convert submodule includes functionality to interconvert the molecular information in MoleculeData with the corresponding molecule or system objects in the libraries RDKit,13 OpenMM,14 Gemmi,15 and BioSimSpace.8 This can be used directly, e.g., creating RDKit molecules via the RDKit SMILES functions, then converting to BioSimSpace for forcefield parameterization, then converting to OpenMM for GPU-accelerated molecular dynamics simulation (Fig. 3). Or, alternatively, it is used in the background by higher-level sire functions that leverage the capabilities of those packages; e.g., using sire.load to load a PDBx/mmCIF file will call Gemmi to load the file into a gemmi::Structure and will then convert that into a sire.system.System comprising one MoleculeData per molecule. Similarly, passing a SMILES or SMARTS string to the sire search engine will trigger an automatic conversion to RDKit to perform the search. Calling the .dynamics() function on any molecules will trigger an automatic conversion to OpenMM to perform GPU-accelerated molecular dynamics simulations.
A Jupyter notebook in which RDKit is used to create a molecule from a SMILES string. Sire converts this to BioSimSpace, which is used to parameterize and solvate the molecule. Sire converts the result to OpenMM, which is used to perform GPU-accelerated molecular dynamics.
A Jupyter notebook in which RDKit is used to create a molecule from a SMILES string. Sire converts this to BioSimSpace, which is used to parameterize and solvate the molecule. Sire converts the result to OpenMM, which is used to perform GPU-accelerated molecular dynamics.
III. SCIENTIFIC APPLICATIONS DEVELOPED WITH SIRE
Sire has been used over the years as a library to prototype a range of biomolecular simulation methodologies.
The waterswap application was developed to compute the absolute binding free energies of small molecules to proteins using a Gibbs-Ensemble methodology with Monte Carlo sampling.1–3 This approach was later extended to produce the applications ligandswap and proteinswap, providing relative free energies of binding through a dual topology approach.
The quantomm application was developed to enable the estimation of free energy differences at a quantum mechanics/molecular mechanics (QM/MM) level of theory by combining molecular mechanics (MM) alchemical free energy calculations with MM to QM/MM end-state corrections.4 It was implemented by using the IO and Monte Carlo functionality of sire to interface a multiple-timestep Monte Carlo sampler that was driven by various QM engines.
The nautilus application was developed to implement the Grid Cell Theory methodology for the spatial resolution of free energies, entropies, and enthalpies of water molecules in the vicinity of an organic or a biological molecule.5,41–44 The implementation combined the IO and energy/trajectory analysis functionality of sire with post-process molecular dynamics simulation trajectories generated by different MD engines.
The FESetup application was developed to automate the preparation of input files for alchemical free energy calculations with a variety of biomolecular simulation packages.7 FESetup reused sire functionality to parse topologies and manipulate representations of molecules to generate input files suitable for single-topology alchemical free energy calculations.
The somd application was developed by interfacing sire’s IO, energy, and computer algebra functionality with the GPU-accelerated MD functionality of the OpenMM library to create a single topology alchemical molecular dynamics free energy calculation engine. somd has extensively been used over the years to develop alchemical free energy calculation methodologies and to support drug discovery projects.6,12,40,44–58
The BioSimSpace library was developed on top of sire to take the concept of molecular interoperability engines further and capture programmatically entire “best-practice” protocols for a range of popular biomolecular simulation tasks in a single Python package with standardized interfaces.8 BioSimSpace makes extensive use of sire’s IO and molecule representation functionality to “glue” components from the biomolecular simulation ecosystem. Initially developed as an academic community-driven effort by CCPBioSim, BioSimSpace was later funded by industrial partners who saw value in the framework to optimize biomolecular software R&D processes for drug discovery. A suite of tutorials demonstrating the range of scientific functionality supported by BioSimSpace is available elsewhere.9
IV. LICENSING
Sire is a C++/Python program that is distributed under the GPL license25 (initially GPL2, and GPL3 since 2022). This license was chosen to remove any complexity related to intellectual property claims made by the universities (predominantly in the UK) that employed the Ph.D. students and postdocs (PDRAs) who contributed to sire's later development. SIre started as a rewrite of the ProtoMS program.60,61 ProtoMS was written during the doctoral studies of Woods during an industrially funded Ph.D. in 2003. At that time, little academic software was released under an open source license, and it was challenging to convince the University of the merits of releasing ProtoMS openly. Fortunately, the industrial partner on that project was supportive, and so an agreement was made that ProtoMS could be released under the GPLv2 as long as a contract was signed that forbade Woods, the University or that industrial partner from ever trying to separately commercialize the software. This was acceptable to all, and ProtoMS was released openly.
Woods subsequently developed sire as a rewrite of ProtoMS during a year-long self-funded project between the end of his Ph.D. studies and a PDRA position. Woods released sire under the GPLv2 before taking up employment with a University, which had the effect of preempting many questions related to ownership of intellectual property (IP) by employers. Sire was subsequently developed as a tool to further Woods’ PDRA research objectives. Each grant that funded that research included a clause that all software developed during those projects would be released openly under the GPL. These clauses were added to all other grants for collaborators and industrial partners as use and development of sire grew. The choice of the GPL made it very easy to negotiate IP, as it was very clear to all parties that all code would be released openly, and there was no risk of any one group developing an extension of sire that would be kept private and then separately commercialized under a different license (a perennial possibility for non-copyleft open source licenses such as MIT or BSD).
The GPL license is well aligned with the community-driven development model of sire. Different business models are available to organizations using sire with proprietary software. In one example, a commercial software vendor ships two Python interpreters to power the back end of a commercial user interface. One interpreter is reserved as an application launcher for GPL-licensed code, while the other interpreter runs the non-GPL code, meaning that the two categories of code communicate “at arms length.” In another example, a biotech company has adopted a vertical integration strategy and developed on top of sire computational chemistry software to support internal drug discovery projects. The software is not distributed in a legal sense, and this mode of usage is fully compliant with the GPL license. Where appropriate, code modifications and enhancements are contributed back to the community under the GPL. Cloud-computing businesses that provide software as service offerings through the Internet may provide network access to software derived from sire hosted on a server without having to distribute the source code.
One ongoing challenge of the choice of GPL is that it creates an imbalance within the academic community. While sire (and its derivatives) are free to build on academic code released under non-copyleft BSD/MIT-style licenses, those communities are not able to pull code from sire into their codebases without re-licensing to GPL. A part-solution to this is that the sire developers are good community citizens and are happy to discuss the extraction and re-licensing of subsets of the Python and C++ code from sire under non-copyleft open source licenses where it is clear that we own the copyright and that this is to support other open community-driven molecular toolkits. To give a specific (hyperthetical) example, the free energy perturbation support code in sire utilizes concepts of a LambdaLever and a LambdaSchedule. These classes can be used to algebraically control the perturbation of forcefield components in very bespoke and custom ways within OpenMM contexts used for free energy simulations. These classes (and ideas) could be potentially useful outside of sire. Rather than require others to duplicate or re-implement this code, we would be open to discussions with groups who would like to extract these classes and to change the license on these extracts to e.g., MIT or BSD so that they can be compatible with other open source toolkits.
V. MAINTENANCE STRATEGY
Development and maintenance of sire was originally ad hoc, conducted by Ph.D.s, PDRAs, and academic staff as a secondary concern behind the daily demands of their research projects. This became increasingly untenable as the commercial use of sire grew. Initially, maintenance moved to be funded as a side-activity of larger academic/industrial grants (e.g., EP/P022138/1, which funded the development of BioSimSpace) or academic/industrial EPSRC Impact Acceleration Awards with industrial users. Eventually, these became untenable for a variety of reasons. A major factor was that the provision of maintenance to existing research software is not a strong focus of research-intensive universities. This realization led to the creation of the not-for-profit OpenBioSim Community Interest Company (https://openbiosim.org), which enables the provision of professional scientific software engineering services to a variety of stakeholders in academia and industry. The remit of OpenBioSim is the development and maintenance of open-source biomolecular simulation software. It currently supports both sire and BioSimSpace.
Development is now formalized, with clean development and maintenance branches and a regular release cadence. The development team has grown, with 15 contributors, including three core developers with direct commit access to the repository. As a small team, governance is light-weight, with changes discussed between core developers in response to feature requests and OpenBioSim roadmaps. These roadmaps are created via discussions between OpenBioSim members and its industrial partners. Contributions are welcome from the community, with all pull requests from forks reviewed by a core developer for acceptance into the project. Development is conducted via feature branches from devel. These are merged following the passing of unit tests run via GitHub Actions and a code review from a core developer. Releases are made at the end of each quarter from the then current, working version of the code in devel. This is merged into the main branch, with longer regression tests run to further assure code quality and manage backward compatibility. A CalVer62 versioning scheme is used, e.g., 2023.1.0 was the first release in 2023. Bugs and issues identified between releases are fixed via issues branches from devel, which are backported to main on acceptance. Patch releases from main are made within the quarter, e.g., 2023.3.1, depending on the number or urgency of the fixes. Separately, commercial partners operate their own release schedule based on forks and internal builds of the software from main. This process was developed in discussion with those partners as the best balance between the continual academic development of software and the timely provision of stable, predictable releases that can be supported by OpenBioSim’s developer resources.
VI. TUTORIAL-DRIVEN DEVELOPMENT
The age and gradual evolution of sire mean that it is a large program (over 400k lines of code), and the sire API contains a lot of functionality. During the 2010s, it was found that it was difficult for researchers to know what functionality was available in the program or even how to use it. In 2022, as part of the move to OpenBioSim, it was decided to give sire a complete refurbishment and refreshment. The original C++-inspired Python API was moved to the sire.legacy submodule, and a new fully Pythonic API was created. This work was undertaken using a tutorial-driven development process.63 Tutorial-driven development involves writing the tutorial for how to use a program first and then using that tutorial as the guide to write the code that makes the tutorial work. The rationale for this process was twofold:
The tutorial is the common meeting point for all stakeholders in a program. Writing the tutorial first lets everyone play a part in saying how a new feature should develop and whether the API for that feature makes sense.
Time spent writing code that is not described in a tutorial is wasted because researchers will not know that this code exists or how to use it.
The tutorial-driven design process progressed through 2022, resulting in the 2023.1.0 release of sire with an accompanying complete set of tutorials at https://sire.openbiosim.org. Development has continued using this process through 2023, with quarterly releases and corresponding expansions of the new Pythonic API and complementing tutorials, making sire a much easier-to-use and more community-friendly program.
VII. DISTRIBUTION MECHANISM
It is challenging to package sire because we endeavor to maintain compatibility with a number of other Python-based toolkits. We do this because derived software, such as BioSimSpace, functions by installing and importing sire into Python environments in which a large number of other toolkits are also present. Being written in C++ presents challenges with respect to maintaining binary compatibility with other libraries, especially across multiple versions of Python and multiple operating systems. A design goal is that sire is easily installable, without compilation, on Windows, Linux, and MacOS. In the past, this forced us to maintain our own binary Python distributions, which were shipped as single “one-click” installer files. Fortunately, in 2012, Anaconda Python was released, and by 2016, conda-forge64 was in widespread use. Through 2017–2020, as part of the BioSimSpace project, we undertook the work needed to move forward to create conda-forge compatible sire packages. To ensure compatibility with BioSimSpace’s dependencies, at present, for each release, conda packages are built against the then-latest versions of all of sire’s and BioSimSpace’s dependencies from conda-forge. These are built for three versions of Python (currently 3.10, 3.11, and 3.12) for Windows, MacOS, and Linux. Packages are built using GitHub Actions, with compilation taking between 90 and 120 min per package. The build is too long and complex to be supported by conda-forge, and therefore, sire packages are built by us and uploaded to the OpenBioSim conda channel on anaconda.org.65 As individual packages are large and space is limited on this channel, only the latest two major versions of the packages are available from here. Older packages are archived on a separate OpenBioSim archive channel, which is hosted on Azure.66 The specific versions of dependencies used by the conda recipe are easy to edit. We support industrial partners to create custom conda recipes and packages that are compatible with the specific details and versions of dependencies of their Python environments.
We were early adopters of Jupyter notebooks, seeing the value in using them for molecular simulation and making them a key user interface for BioSimSpace as it was developed in 2017.8 We now run an open-access JupyterHub server with the latest major version of sire installed at https://try.openbiosim.org. This lets researchers try sire in their browsers without having to install anything.
Finally, sire is and always will be open source software. All 7000+ versions of the source code since its inception in 2005 are online, with releases documented in the changelog.67 As described and linked to from this changelog, development started on a local subversion repository before moving to a Google Code subversion space in 2006, then the michellab GitHub Organization in 2015, and then the OpenBioSim GitHub Organization in 2023.
VIII. CONCLUSION
We could not have foreseen, in 2005, that nearly 20 years later, sire would still be being developed and in use by industrial and academic groups internationally. Creating software that is deliberately meant to be hidden and not visible to end-user researchers has many challenges. The software has been executed millions of times,68 but often with most researchers being oblivious to that use. In addition, so arguing for and financing its continued maintenance and development is difficult, as it is not immediately obvious why sire is important or what it does. Conversely, a delicate balance is always walked between making sire more visible and more capable as a foundation for other programs, while also not going so far as to replace or compete with the other packages in the rich biomolecular simulation software ecosystem on which we depend. However, as academic software development moves away from self-contained codes and developers move more toward a “divide and conquer” approach of providing small, well-defined software solutions, we have learned exactly what sire is and how we achieve that balance. Sire is a molecular information and interoperability engine, which makes it easier to combine and leverage the functionality of many other programs and libraries. It is the interoperability layer that provides a “town square” where other programs can meet and exchange molecular information and functionality. Keeping that in mind and being good citizens in the wider community of open source biomolecular simulation software developers should help ensure that the use of sire grows and that it continues to be actively developed, maintained, and useful for academic and industrial researchers for many years to come.
ACKNOWLEDGMENTS
We acknowledge the funding that supported the development of sire over nearly 20 years. Funding included research grants from EPSRC that required prototyping functionality, which indirectly funded the early development of sire: Grant Nos. EP/E022197/1, EP/F010516/1, and EP/G042853/1; research grants from EPSRC and BBSRC to develop software built on top of sire: Grant Nos. EP/I030395, BB/K016601, EP/P011330/1, EP/P022138/1, and EP/V011421/1; industrial–academic partnerships funded via EPSRC Impact Acceleration Accounts (University of Bristol 2016 and University of Edinburgh, IAAPIII074) and InnovateUK funding (KTP Partnership No. 011120); industrial partners and other supporting organizations on many of these awards, from Cresset, Drug Design Data Resource, Evotec, Exscientia, Memorial Sloan-Kettering Cancer Center, Molecular Science Software Institute, Software Sustainability Institute, Syngenta, and UCB; EPSRC research fellowships awarded to C.J.W. (Grant No. EP/N018591/1) and A.M. (Grant No. EP/G007705/1); and the Royal Society University Research Fellowship and ERC Starting Grant (No. 336289) awarded to J.M. We also acknowledge the support of the wider simulation community, in particular CCPBioSim (Grant Nos. EP/J010588/1 and EP/T026308/1 and EP/M022609/1), for helping organize and host training workshops for sire-derived codes and supporting funding applications for its continued development. We acknowledge the support from HECBioSim (Grant Nos. EP/R029407/1, EP/R029407/2, and EP/X035603/1) for the provision of high-performance computing resources to validate scientific methodologies implemented by sire-derived codes. H.H.L. has contributed to sire as an employee of Cresset and the STFC, in the latter role funded by CCPBioSim (Grant No. EP/T026308/1).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts to disclose. J.M. is a member of the Scientific Advisory Board of Cresset.
Author Contributions
Christopher J. Woods: Conceptualization (equal); Formal analysis (equal); Funding acquisition (equal); Investigation (equal); Methodology (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Writing – original draft (equal); Writing – review & editing (equal). Lester Hedges: Software (equal). Adrian Mulholland: Funding acquisition (equal); Supervision (equal). Maturos Malaisree: Investigation (equal). Paolo Tosco: Software (equal). Hannes H. Loeffler: Methodology (equal); Software (equal). Miroslav Suruzhon: Methodology (equal); Software (equal). Matthew Burman: Software (equal). Sofia Bariami: Software (equal). Stefano Bosisio: Methodology (equal); Software (equal). Gaetano Calabro: Methodology (equal); Software (equal). Finlay Clark: Methodology (equal); Software (equal). Antonia S. J. S. Mey: Methodology (equal); Software (equal). Julien Michel: Conceptualization (equal); Funding acquisition (equal); Project administration (equal); Resources (equal); Software (equal); Supervision (equal); Writing – review & editing (equal).
DATA AVAILABILITY
Data sharing is not applicable to this article as no new data were created or analyzed in this study. The software is available at https://github.com/openbiosim/sire.