Materials digital data, high throughput experiments and high throughput computations are regarded as three key pillars of materials genome initiatives. With the fast growth of materials data, the integration and sharing of data is very urgent, that has gradually become a hot topic of materials informatics. Due to the lack of semantic description, it is difficult to integrate data deeply in semantic level when adopting the conventional heterogeneous database integration approaches such as federal database or data warehouse. In this paper, a semantic integration method is proposed to create the semantic ontology by extracting the database schema semi-automatically. Other heterogeneous databases are integrated to the ontology by means of relational algebra and the rooted graph. Based on integrated ontology, semantic query can be done using SPARQL. During the experiments, two world famous First Principle Computational databases, OQMD and Materials Project are used as the integration targets, which show the availability and effectiveness of our method.
I. INTRODUCTION
Traditionally, the materials science heavily relies on costly experiments and simulation based methods to understand the intrinsic mechanisms of the relationships among processing-structure-property-performance (PSPP). Currently, the big data generated by high throughput experiments and computations has provided a great opportunities for data-driven based techniques, which is one of the three pillars of materials genome initiatives (MGI). Data-driven materials techniques are playing a big role in revealing PSPP relationships in materials science, which not only can be used for both property prediction based forward models, but also materials discovery based inverse models. So, the data-driven featured materials science is regarded as the essential content of materials informatics, which provides the foundations for fourth paradigm of materials discovery.1,2
With the data generating techniques are getting easier than before, the requirement of data sharing and data integration gradually become urgent.3 However, this problem faces many challenges. Firstly, the data types are quite different from discrete to continuous, from simple text to complex images, videos, etc. Secondly, in different data sources the data schema or format are quite different which makes it difficult to understand with each other. Thirdly, the data quality in different sources are also different, which makes data evaluation and selection difficult. Finally, even we have found the data, recognizing what the rows and columns represent can be another challenge, because many of the datasets have machine-readable descriptions, but often these are in very large data dictionary files full of terminology that is often designed primarily for the experts in a given field.
Ontology, which is used to capture knowledge about some domain of interest, is widely used in knowledge engineering, information retrieval, information integration, etc. An ontology usually describes not only the concepts in a domain, but also the relationships that hold among those concepts. This paper presents a methodology that integrating heterogeneous relational databases by transforming one database into ontology and mapping others into it, and then do semantic query on the integrated ontology based system.
The rest of the paper is organized as follows: Section II and III discuss the related work and the framework of the whole system. Mapping from the relational database to ontology and other heterogeneous database integration are described in Section IV and V. The experimental integrations of two famous materials databases are shown in Section VI. Section VII provides some final conclusions and directions for the future work.
II. RELATED WORK
The integration of distributed heterogeneous database, sometimes called data integration, is an active area of research. Concerning about the heterogeneous database integration, it can be divided into two categories according to the query method. One is the data warehouse which means the whole data is integrated and stored in one data source. The other is on-demand retrieval that only when the end users send the query to the system, the query execution engine extracts data from the different data sources.
Query expansion is an important issue in the field of information retrieval. Chokri et al put forward Ontology-based Query Expansion4 which can expend a single SQL (structured query language) query into several queries. It utilize the synonym and parent concept in ontology to fulfil the expansion which is only suitable when the attribute in database equals to the concept in ontology.
Bonatti et al put forward an ontology extended relation (OER), which contains an ordinary relation as well as an associated ontology conveying semantic meaning about the terms being used.5 They extended the relational algebra to query OERs. And the advantage of their method is that OER model can not only be directly built on top commercial relational databases, but also can be scaled to handle large data sets.
Ranganathan and Liu6 proposed a system to bridge the semantic gap between the user given queries and the queries can be answered by the database. They use domain knowledge contained in ontologies, that extends relational databases with the ability to answer semantic queries expressed in SPARQL.7 Based on a semantic model of data, end users express their queries in SPARQL, and they get back semantically relevant results. The experimental results show a good performance on sample relational database, using a combination of standard and custom ontologies.
Ontologies are becoming increasingly commonplace for semantically representing knowledge in a formal manner that facilitates sharing and integrating rich information for materials informatics. Moreover, ontologies can support logic reasoning by rule engines that enhance knowledge acquisition automatically. Generally speaking, the purpose of materials informatics ontologies can be defined as three concrete objectives:8 (1) Translate data and information into knowledge that is useful, not only to materials scientists, but also to application engineers, regulators, and other users. (2) Curate the knowledge base to align with the emerging materials scientific research and industrial application development. (3) Present the knowledge in a flexible architecture that is understandable to each kind of user group. So far, quite a lot materials ontologies have been defined, and the representative ones are as follows:
Plinius ontology9 is the earliest materials ontology which is developed for ceramic materials that covers the conceptualisation of the chemical composition of materials. The Plinius ontology is given as a conceptual construction kit, involving several sets of atomic concepts and construction rules for making complex concepts. Plinius ontology does not depend on specific language so that it can be implemented in several languages or tools, such as Prolog, Ontolingua and LOOM.
Ashino et al10 developed an information platform for data exchange between heterogeneous materials data resources, in which there are two components, materials data portal service and ontology based materials data exchange. The materials data portal service mainly aggregates materials databases’ information through RDF site summary (RSS) technology, which includes materials type, properties or other items to identify a material database. Ashino ontology covers quite a few fields in materials science especially on thermal properties. The Ashino ontology contains more than 600 classes implemented in OWL (Web ontology language).
Cheung et al11 developed a semantic web application, named MatSeek, that aims to integrate heterogeneous databases associated with materials science. MatSeek relies on a machine-processable OWL ontology (MatOnto12) to correlate processing parameters with nano-structure, physical and chemical properties to help scientists discover potential new materials for specific and high-priority applications. MatSeek provides a federated search interface over several critical materials science databases, such as the Inorganic Crystal Structure Database (ICSD), NIST Phase Equilibria Diagrams Database (PED), etc.
ONTORULE steel ontology13 is developed by European Union which is focus on coils, defects, phenomena etc. It is developed for the steel industry which aims to build the conceptual model with the steel use case. The ONTORULE ontology is implemented in OWL.
FreeClassOWL14 is developed for European construction and building materials market, which allows for the fine-grained descriptions and search for products, suppliers, and warehouses for any building-related sourcing needs. Based on FreeClassOWL, Eurobau Utility ontology and the BauDataWeb RDF dataset, BauDataWeb15 has become one of the largest and richest public datasets for a well-defined vertical sector that is available on the Semantic Web.
Premkumar et al16 developed a novel Semantic Laminated Composites Knowledge Management System (SLACKS) that reuses part of the structure of Ashino ontology and MatOnto ontology. SLACKS ontology is developed for the engineering of laminated composites structures which integrates relevant domains of the product life cycle, such as design, analysis, manufacturing and materials selection through the engineering case study of a wind turbine blade. Using SLACKS ontology, it reveals a usable product life cycle knowledge tool that can facilitate efficient knowledge creation, retrieval and reuse from product design to manufacturing.
MatML is an extensible markup language (XML) developed especially to facilitate the materials information exchange, which can uniformly represent materials property data to resolve syntactic and structural heterogeneity.17,18 Although, MatML is simple, flexible and understandable, Ashino and Oka19 have shown that MatML is not adequate for data exchange between heterogeneous materials database and proposed a ontology framework to define the structure of domain concepts. Zhang et al20 proposed an approach to transform MatML-based materials data into an OWL ontology(named MatOWL). Using MatOWL, materials data can then be explored in a more semantic way. Furthermore, MatOWL can be mapped to other ontologies with logic rules to provide more semantic context for domain experts. Using MatOWL more materials knowledge can be obtained by reasoning on the OWL ontology.
III. THE FRAMEWORK OF THE WHOLE SYSTEM
Supposing we have two heterogeneous databases, the idea of our method is to convert one basic materials database to a materials ontology and then integrate the other. After that the data in each database will be integrated into the ontology. End users can do semantic query on the integrated ontology. Figure 1 shows the whole procedure.
System architecture of the ontology based heterogeneous databases integration.
Although there exists several materials ontologies as mentioned in related work, most of ontologies tend to represent one sort of material, or special fields and applications. Recently, ontologies are often built manually, sometimes it is complicated and time consuming that needs domain experts to participate in. So, we adopt a semi-automatic method to build a material ontology according to the structure of a comprehensive materials database. First of all, we extract the relational model from database by DBC API. Then we generate a material ontology according to the relational model and some conversion rules. The tuples from database can also be converted to the ontology individuals according to the conversion rules. And then we build an algebraic model for the materials ontology and the other materials database. After that we get the relation between them, which can be used to convert the data from the database into the individuals of ontology. Through these steps, the heterogeneous databases can be integrated together. Moreover, when using relation algebra to integrate the ontology and the database, ontology individuals are used as the data carrier which is more suitable to SPARQL. And we adopt ontology rules to make the query results more accurate.
IV. MAPPING FROM RELATIONAL DATABASE TO ONTOLOGY
Currently, building ontology according to the relational database can be divided into three categories, which are manual, semi-automatic and automatic. Manual, for instance21 is usually for a particular field. During the manual ontology building, some hidden mapping relations could be found, but it is time-consuming and difficult for normal researchers to build the ontology manually. Semi-automatic, such as22 is usually realized by interacting with the domain experts. During the procedure of building ontology, end users can participate in the verification and modification of the mapping results. Methods for automatically building ontology is rarely used because of the low accuracy. Therefore, in this paper, we also use the semi-automatic approach to build the ontology.
About the ontology representation, Web Ontology Language (OWL) is the latest standard recommended by W3C.23 It is a vocabulary extension of Resource Description Framework (RDF). And OWL facilitate greater machine interpretability of web content than that of XML, RDF and RDFS. So in this paper we choose OWL as the ontology description language.
A. Materials science tetrahedron for root concepts
In materials science, we mainly focus on the study of the structure, performance, processing and properties, which is called materials science tetrahedron24 as shown in Figure 2.
In Figure 2, the structure of materials include bonding structure, crystal structure and organization structure. The bonding structure includes chemical bonds (ionic bonds, covalent bonds, metal bonds) and physical bonds (hydrogen bonds, molecular bonds). The crystal structure of the material includes crystal, non-crystal and quasi crystal. The organization structure refers to the characteristics which represented by the different components of materials. The properties of materials are the response of a material to electrical, magnetic, optical, thermal and mechanical loads, that include mechanical properties, physical properties, chemical properties, etc. Processing means all unit operations, milling, blending, tableting and relevant processing parameters. Performance is a kind of characterization parameters of a material under certain conditions, in order to describe the act or result of the material. It usually includes manufacturability, content uniformity, etc.
Therefore, for materials ontology building, we create five owl:Class for material, structure, properties, processing and performance. And most of the owl:Class converted from database will be the subclass of them.
B. Conversion from relational database to ontology
Our approach is to classify the relation, and then formalize the corresponding conversion rules. Using database commander (DBC), we can utilize database relational model and convert it according to the predefined conversion rules. And then output OWL based ontology. Figure 3 shows the whole conversion process.
Conversion flow chart from relational database to OWL based ontology.
Relational Database. The relational database is a 6-tuple model: Rd = {U, D, DOM, F, PK, FK}, where Rd is the name of relation; U is the set of attribute names that come from the relation; D is the domain that the attributes in U come from; DOM is the mapping set from attributes to domains, in which we use DOM(Ui) to specify the type, range, length, etc. of Ui; F is the set of the data dependencies among attributes; PK(Rd) is the set of the primary keys of Rd and FK(Rd) is the set of the foreign keys of Rd.
Ontology. We can describe an ontology by a number of sets of concepts, relations, lexical entries, and links between these entities. The definition of the ontology is 5-tuple model:25 O = {C, Hc, R, rel, Ao}, where O is the ontology name. C is the set of concepts. Hc is a taxonomy of concepts with multiple inheritance. For example, Hc(C1, C2) notes that C1 is the subconcept of C2. R is a set of non-taxonomic relations described by their domain and rang restrictions. rel(R) describes a heterarchy of relations. For example, rel(R) = (C1, C2) specifies that there is a relation R between C1 and C2. Ao is the set of axioms.
Keywords Set. K is the set of keywords in materials science, including different kinds of materials name such as performance, properties, structures and processing, for example, tensile strength, density, cold forming, porous, etc. The keywords set can be added as needed.
Next, we will divide different types of relations and discuss how to define different rules for mapping the relational database to ontology automatically.
Type (a): For the database relation, it has primary keys but does not have foreign keys. That is and . This kind of relation is the basic entity. We have 4 rules for this kind of mapping.
Rule 1: Convert each database relation name to a concept. That is . During the actual conversion process, Ci is a owl:Class.
Rule 2: If a database relation name is a materials keyword, then we can set the corresponding concept taxonomy. That is to say, if , then we can get Hc(Ci, Cmaterial) or Hc(Ci, Cproperties) or Hc(Ci, Cperformance) or Hc(Ci, Cstructure) or Hc(Ci, Cprocessing). In the actual conversion process, we set Ci owl:subclass-of Cproperty or Cmaterial etc. For example, if the database relation name is yield strength(a kind of mechanical performance), we set Ci owl:subclass-of Cperformance to enrich the ontology.
Rule 3: Convert each attribute to a concept and set the corresponding non-taxonomy concepts relation. That is, for each Uj U, we have Uj → Cj and rel(Rj) = (Ci, Cj). During the actual conversion, we create a owl:DatatypeProperty for Uj and set its rdfs:domain = Ci, DOM(Uj) →rdfs:range.
Rule 4: If an attribute name is a materials keyword, we convert it to a concept and set the corresponding concept taxonomy. That is, if Uj K we have Uj → Cj and Hc(Cj, Ci). During the actual conversion, Cj is a owl:Class. We set Cj owl:subclass-of Ci and Cj owl:subclass-of Cpropertie or Cmaterial, etc. according to the keyword at the same time.
Figure 4 is the visual conversion rules for type (a). During the processing of Rule 2 and Rule 4, setting subclass should be supervised by the domain experts. Figure 5 is an example for type (a). Metallic is the database relation name and it’s a materials keyword, so that we convert it to an owl:class and set it to a subclass of Materials. Tensile Strength, Grade, Formula, Mass and Name are the attributes of Metallic materials. The Tensile Strength is a materials keyword, we convert it to owl:class and set it to a subclass of Metallic materials. For the rest of attributes, each of them will be converted to an owl:DatatypeProperty and set its domain to Metallic materials and set its range according its DOM.
Type (b): For the database relation which only has one primary key and one foreign key, moreover the primary key is the same to the foreign key. That is, , and . This kind of relation is usually caused by the inheritance among entities. In this condition, we use the following rule for mapping.
Rule 5: Convert the database relation to a concept and set the corresponding concept taxonomy according to its foreign key. That is, and set Hc(Ci, Cj). During the actual conversion process, Ci is an owl:Class and we set Ci owl:subclass-of Cj.
Figure 6 shows the conversion rules for type (b) and Figure 7 is a mapping example for type (b). The relation has one foreign key. High Strength Steel is the database relation name and Grade is the foreign key which related to Metallic materials. We convert the High Strength Steel to an owl:class and set it as a subclass of Metallic materials.
Type (c): For the database relation which has two attributes, two primary keys and two foreign keys. And each of the foreign key is a primary key for another database. That is, and for we have and . This kind of relation is usually caused by the many-to-many relationship among entities. In this case, we use the following rules for mapping.
Rule 6: According to the concepts corresponding to the foreign keys, create two relations to specify the many-to-many relationship between two concepts. That is and rel(R2) = (Ck, Cj). During the actual conversion process, we convert R1 and R2 to owl:ObjectProperty. And R1’s rdfs:domain = Cj, R1’s rdfs:range = Ck and R2’s rdfs:domain = Ck, R2’s rdfs:range = Cj. And we set R1 owl:inverse-of R2.
Figure 8 is the conversion rules for type (c) and Figure 9 is a mapping example of type (c). Structures_id is the foreign key related to Structures and Element_id is the foreign key related to Element. We need to build the relation between Structures and Element so that we convert two attributes to two owl:ObjectProperties. Set the owl:ObecjtProperty Structures-Element’s domain to Structures and range to Elements. Similarly, Element-Structures should be handled in the same way.
Type (d): For the database relation which primary key is not empty and only has one foreign key. That is, . This kind of relation is usually caused by the one-to-one or one-to-many relationship between two entities.
Type (e): For the database relation which primary key is not empty and has more than two foreign keys. That is, . This kind of relation is usually caused by the multiple relations among entities.
The database relation of type (d) and (e) should be convert to a owl:Class. So that can use Rules 14 of type (a). Besides Rule 7 can be used to specify the foreign key relation, cardinality restrictions in OWL can specify the one-to-one and one-to-many relationships. Therefore we use the following 2 rules:
Rule 7: According to each concept corresponding to the foreign keys, create a relation to specify the relationship between two concepts. That is, for each there exists and we have fkm → rel(Rm) = (Ci, Cj). During the actual conversion process, we convert Rm to a owl:ObjectProperty. Rm’s rdfs:domain = Ci, Rm’s rdfs:range = Cj. If fkm can not be empty we set owl:minCardinality = 1, otherwise set owl:minCardinality = 0.
Figure 10 is the conversion rules for type (d) and (e). Figure 11 is an example for type (d) and (e), where Materials has two foreign keys Performance_id and Structures_id. We convert both of them to owl:ObjectProperty whose domain are Materials, and range are Performance and Structures.
And there are cardinality restrictions for foreign keys on the basis of whether they are empty or not. We can use Rule 8 and Rule 9 for conversion.
Rule 8: If the attribute can not be empty we should set its cardinality restriction. That is if Uj U null we have rel(Rj) = (Ci, Cj). During the actual conversion process, we convert Rj to a owl:DatatypeProperty and set its owl:Cardinality = 1.
Rule 9: If the attribute can be empty we should set its cardinality restriction. That is if Uj U we have rel(Rj) = (Ci, Cj). During the actual conversion process, we convert Rj to a owl:DatatypeProperty and set its owl:maxCardinality = 1.
C. The data conversion
In order to integrate heterogeneous databases, we should convert the data from database to the ontology’s individuals. After the conversion according to the rules as mentioned above, we can convert the data easily. Suppose that I(Ci) is the individual of Ci and is the data tuples of .
Rule 10: Convert each tuple in database to a individual and give a unique identifier. That is, for each . During the actual conversion process, we use the database relation name and the primary key name to be the unique identifier. And we convert the data in tuple to the statement and make it connect to the corresponding owl:DataProperty. Then we convert the foreign keys to owl:ObjectProperty.
So far, using the above conversion rules, we have obtained a materials ontology where all the data of one materials database have been stored. Next, we should consider how to integrate other database to the created ontology easily.
V. INTEGRATE OTHER DATABASES TO ONTOLOGY
In order to integrate other databases to ontology, we build a mathematical structure for the materials ontology and the other materials database.
A. Define the ontology as a mathematical structure
In ontology, “C” is the set of the concepts. “×” is a binary operation on C that represents the combination of two concepts. It is obvious that the order of the combination of two concepts doesn’t matter. So the concept c = c1 × c2 and c′ = c2 × c1 are the same. That means “×” is commutative. In addition, “×” should be idempotent.
So, let < C, ×, ec > be a commutative idempotent monoid of concepts. ec is a kind of pseudo-concept which is neutral to the concepts. That is to say, for Ci C we have Ci × ec = ec × Ci = Ci.
It is obvious that the combination operator “×” satisfies the associative law. However, it is worth mentioning that, in some cases, there are some no real meaning concepts coming from the combination of concepts in C. And those concepts just satisfy the closure property of the monoid.
Part-of-relation. For every c1, c2 C, if c1 is part-of c2, we denote c1 c2 (c|c C : c1 × c = c2).
The part-of-relation is a partial order. It satisfies the three axioms of posets. Since the mereological relationships is a simplification of a partial order,26 we can use the poset properties to build the structure of the ontology and link it to the material database.
Starting from the concept of the main domain, we define L is a subset of C which contains the main concept of C and all its parts until the atoms. The part-of-relation forms a boolean lattice of concepts,27 that is = (L, ). The pseudo-concept concept ec is also included in L. So that two concepts do have one concept ec even if they are structurally unrelated.
There may be some concepts in L associated with other concepts in C by one or more relationships. The concept in L we call it the ancestor element of the relationship in that case. We consider that only subsets of concepts can be connected to the lattice. The relationships can be represented as rooted graphs and the root of the rooted graph is always part of the lattice.
= is a family of rooted graphs. Each Gi we associate a relation Ri which should be connected to the top element. For a C, Ra defined as {(x, x)|x = a}.
Rooted Graph. For Ci C and Ri is a relation on Ci. Rooted graph , where “*” is the power operation. As,28 we denote a rooted graph as Gi = (Ci, Ri, ti), and ti is the root of Gi.
Thus, we can define the ontology as a mathematical structure:
Mathematical Structure of Ontology. Supposing C is the set of concepts. = (L, ) is a Boolean lattice. = is a family of rooted graphs. So, we have the mathematical structure of the ontology = (C, , ).
The definition specifies that each relation has a ancestor element which is the root of the rooted graph. For each root in ends up in . The definition also ensures that all the concepts in relation can connect to the lattice.
B. Relational database structure
To model the materials database, we can use the relational algebra.
Where, A is the set of relations. “+” and “·” are the binary operations on A which means the intersection and the union operations for the relations. “−” is the unary operations on A means the complement of the relation.
Considering that is the set of attributes 1, 2, , n. J is a set of attributes called type. For each i , there is a attribute domain D(Ai) for Ai which can not be empty. We call a relation of type J is a set of tuples A and the element Ai A is called a tuple. For a tuple Ai A, we have τ(Ai) = τ(A) = J. τ is a operator can get the type of A. In relational database, J is represented as a table with its columns representing each attribute in J. And the tuples represent as rows in a table.
Herein, we can use the relational algebra = (A, +, ·, −) to describe the other materials database. And then we can connect two structures together. In order to do so, we define the operator τ′: A → C to connect the entities to the corresponding concept in the rooted graph. And for rooted graph Gi = (Ci, Ri, ti), we have a operator ρi : Ci → L, ρi(c) = ti, where c Ci. Finally we define the type operator τ : A → L as τ = ρ τ′.
Type of Entity Combination. The type of the entities combination is the combination of the type of each entity, that is τ(a · b) = τ(a) × τ(b).
Mathematical Structure of Heterogeneous Materials Databases. Supposing = (C, , ) is the structure of ontology, = (A, +, ·, −) is the relational algebra and τ = ρ τ′ is the type operator. And then the mathematical structure of heterogeneous materials databases is = (, , τ).
The integration procedure of two heterogeneous databases is as Algorithm 1.
The integration procedure of two heterogeneous databases.
1: Take all main concepts and its parts and subparts until the atoms from C to build the boolean lattice with part-of-relation. |
2: Supposing there exists a set of concepts V (C − L), where each concept Vi V has rel(Ri) = (Lj, Vi) and Lj L. |
For every Lj L, use Lj and the corresponding Vi to build the rooted graph. |
3: Use relational algebra = (A, +, ·, −) to demonstrate the structure of materials database. |
4: For each entity E1, E2, , Ei E (E is also an entity), we have τ′(E1) = C1, τ′(E2) = C2, ⋯ , τ′(Ei) = Ci to associate |
each entity to the concept it instantiates. |
5: The type of each entity Ei, τ(Ei) = ρ(τ′(Ei)) = ρ(Ci). Then can get the type of entity E, τ(E) = τ(E1 · E2 · ⋯ · Ei) |
= τ(E1) × τ(E2) ×⋯ × τ(Ei). |
6: Convert the data from the database to the individuals according to the type of it. |
1: Take all main concepts and its parts and subparts until the atoms from C to build the boolean lattice with part-of-relation. |
2: Supposing there exists a set of concepts V (C − L), where each concept Vi V has rel(Ri) = (Lj, Vi) and Lj L. |
For every Lj L, use Lj and the corresponding Vi to build the rooted graph. |
3: Use relational algebra = (A, +, ·, −) to demonstrate the structure of materials database. |
4: For each entity E1, E2, , Ei E (E is also an entity), we have τ′(E1) = C1, τ′(E2) = C2, ⋯ , τ′(Ei) = Ci to associate |
each entity to the concept it instantiates. |
5: The type of each entity Ei, τ(Ei) = ρ(τ′(Ei)) = ρ(Ci). Then can get the type of entity E, τ(E) = τ(E1 · E2 · ⋯ · Ei) |
= τ(E1) × τ(E2) ×⋯ × τ(Ei). |
6: Convert the data from the database to the individuals according to the type of it. |
VI. EXPERIMENTS
We use OQMD29,30 and Materials Project31 as the experimental databases which are two famous First Principle Computational Databases. The E-R models of the two databases are as shown in Figures 12 and 13. Using mapping rules mentioned in section IV, we convert the OQMD database to an ontology. In order to avoid the identifier duplication, we use TableName-ColumnName to describe the owl:DatatypeProperty. For example, for the table Elements, which is a table belongs to type (e). We build an owl:class Elements. And all the attributes of table Elements are converted to the owl:DatatypeProperty which domain are Elements. Then we covert all the foreign keys to the owl:ObjectProperties which domain are Elements, and range are Atoms, Compositions and Structures. After tuning we get the converted ontology as shown in Figure 14.
Then we build the mathematical model for the ontology and the other materials database Materials Project, as mentioned in Section V. Figure 15 shows the main part of the integrated model. It shows that the concepts, such as materials, structure, elements, spaceGroup, ec, etc. form the lattice of concepts. Concepts hall, latticesystem, symbol, point_group, etc. are part of the rooted graph. And (-P 4 2 3), cubic, P63/mmc, etc. are entities.
Through the matching of synonyms, homoionym and some manual correction we could map the concepts from ontology and the data from database. However, in the actual processing, one database is often difficult to cover another one. So new concepts should be created if we can not find the corresponding concepts for data.
For example, a tuple a(12.03868668, 0, 27.88188999, Ru, -P 4 2 3, cubic, P63/mmc, 4/mmm), we get the type for each attribute, such as τ′(−P 4 2 3) = hall. And in the rooted graph, we have ρ(hall) = spacegroup. Thus we can get the type of (− P 4 2 3) as Equation (1), and all the types of the rest attributes can be obtained in the similar way.
Through definition 8, we get τ(a) = τ(12.03868668) × τ(27.88188999) ×⋯× τ(4/mmm). And we get τ(a) = structure. Then we can convert the data from the database to the individuals according to the type of tuple a. Thus the data of heterogeneous materials database are integrated together.
Once the integrated ontology created, semantic query is allowed on the ontology which is integrated from two materials databases, OQMD and Materials Project. We can construct SPARQL to extract the information from the integrated ontology. For example, if we want to query all the structures that latticesystem equals cubic, construct a semantic query with SPARQL as SPARQL Query Example 1.
SPARQL Query Example 1 . |
---|
1: PREFIX this: <http://shu.edu.cn/material/ontology#> |
2: SELECT ?structure ?volume ?spacegroups WHERE{ |
3: ?structure this:structures-spacegroups ?spacegroups. |
4: ?spacegroups this:spacegroups-lattice_system ?lattice. |
5: ?structure this:structures-volume ?volume. |
6: Filter regex(?lattice,’Cubic’,’i’) |
7: } |
SPARQL Query Example 1 . |
---|
1: PREFIX this: <http://shu.edu.cn/material/ontology#> |
2: SELECT ?structure ?volume ?spacegroups WHERE{ |
3: ?structure this:structures-spacegroups ?spacegroups. |
4: ?spacegroups this:spacegroups-lattice_system ?lattice. |
5: ?structure this:structures-volume ?volume. |
6: Filter regex(?lattice,’Cubic’,’i’) |
7: } |
We show part of the results in Table I. During implementation, JENA API32 is used to execute the SPARQL query. All the results return with ResultSet format which can be operated easily. We can see that both of the data in two databases can be retrieved together that do not need to construct different SQL statement for each relational database. It is also worth mentioning that the query is executed on the ontology not the relational database, so the complicated and time-consuming union or join operations are avoided when the query involves multiple tables.
Part of semantic query results from integrated ontology.
Structure . | Volume . | Spacegroups . | Datasource . |
---|---|---|---|
structures-pri-243 | 25.77007415 | spacegroups-pri-243 | Material Project |
structures-pri-267 | 265.8010647 | spacegroups-pri-267 | Material Project |
structures-34112 | 10.8604 | spacegroups-229 | OQMD |
structures-34224 | 20.8806 | spacegroups-216 | OQMD |
structures-pri-496 | 367.4211175 | spacegroups-pri-496 | Material Project |
structures-34611 | 77.4177 | spacegroups-221 | OQMD |
structures-pri-167 | 69.53995864 | spacegroups-pri-167 | Material Project |
structures-34448 | 33.1271 | spacegroups-229 | OQMD |
Structure . | Volume . | Spacegroups . | Datasource . |
---|---|---|---|
structures-pri-243 | 25.77007415 | spacegroups-pri-243 | Material Project |
structures-pri-267 | 265.8010647 | spacegroups-pri-267 | Material Project |
structures-34112 | 10.8604 | spacegroups-229 | OQMD |
structures-34224 | 20.8806 | spacegroups-216 | OQMD |
structures-pri-496 | 367.4211175 | spacegroups-pri-496 | Material Project |
structures-34611 | 77.4177 | spacegroups-221 | OQMD |
structures-pri-167 | 69.53995864 | spacegroups-pri-167 | Material Project |
structures-34448 | 33.1271 | spacegroups-229 | OQMD |
Furthermore, we can create some additional rules for the materials ontology to refine query results. For example, when a relational database has some common data and the other does not, then we can create some rules to add a external relation for that part of unlinked data. For example, if we want to extract all the individuals with structures that has a relation with element “C”. We may construct the SPARQL query as SPARQL Query Example 2.
SPARQL Query Example 2 . |
---|
1: PREFIX this: <http://shu.edu.cn/material/ontology#> |
2: PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> |
3: SELECT ?structures ?volume ?spacegroups WHERE{ |
4: ?structure this:structures-elements ?element. |
5: ?element this:elements-symbol ?symbol. |
6: ?structure this:structures-volume ?volume. |
7: ?structure this:structures-spacegroups ?spacegroups. |
8: Filter regex(?symbol, ’C$’) |
9: } |
SPARQL Query Example 2 . |
---|
1: PREFIX this: <http://shu.edu.cn/material/ontology#> |
2: PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> |
3: SELECT ?structures ?volume ?spacegroups WHERE{ |
4: ?structure this:structures-elements ?element. |
5: ?element this:elements-symbol ?symbol. |
6: ?structure this:structures-volume ?volume. |
7: ?structure this:structures-spacegroups ?spacegroups. |
8: Filter regex(?symbol, ’C$’) |
9: } |
Although the individuals can be queried from two materials databases. However, as shown in Figure 16, the relationship between structures and elements only exist in OQMD database. So, the relationship between structure data and element data in OQMD database is well organized. But the structure data in Materials Project have no connections with the element data. When execute the query 3, the results can only be responded from the OQMD database. Table II shows the results before adding rules. We can see that only those data from OQMD database can be retrieved. To get the better result we can add a external rule as Ontology rule example 1.
Ontology rule example 1 . |
---|
1: [rule:(?structure this:structures-composition ?composition) |
2: (?element this:elements-symbol ?symbol)regex(?composition, ?symbol) |
3: - >(?structure this:structures-elements ?element)]. |
Ontology rule example 1 . |
---|
1: [rule:(?structure this:structures-composition ?composition) |
2: (?element this:elements-symbol ?symbol)regex(?composition, ?symbol) |
3: - >(?structure this:structures-elements ?element)]. |
Query results before adding rules.
Structure . | Volume . | Spacegroups . | Datasource . |
---|---|---|---|
structures-34224 | 20.8806 | spacegroups-216 | OQMD |
structures-34170 | 16.624 | spacegroups-225 | OQMD |
Structure . | Volume . | Spacegroups . | Datasource . |
---|---|---|---|
structures-34224 | 20.8806 | spacegroups-216 | OQMD |
structures-34170 | 16.624 | spacegroups-225 | OQMD |
The rule means that if the composition of structures contains a certain element we connect the structures and the corresponding element together by owl:ObjectProperty. Thus, when a end user searches structures containing some elements in Materials Project can also be responded which can not be done earlier. Table III shows the searching results after adding rules. From Table III, we can see that two results come from OQMD and four from Materials Project.
Query results after adding rules.
Structure . | Volume . | Spacegroups . | Datasource . |
---|---|---|---|
structures-34224 | 20.8806 | spacegroups-216 | OQMD |
structures-34170 | 16.624 | spacegroups-225 | OQMD |
structures-pri-42 | 41.13742744 | spacegroups-pri-42 | Material Project |
structures-pri-21 | 44.91792373 | spacegroups-pri-21 | Material Project |
structures-pri-55 | 11.41878254 | spacegroups-pri-55 | Material Project |
structures-pri-149 | 21.21334856 | spacegroups-pri-149 | Material Project |
structures-pri-41 | 22.87020916 | spacegroups-pri-41 | Material Project |
Structure . | Volume . | Spacegroups . | Datasource . |
---|---|---|---|
structures-34224 | 20.8806 | spacegroups-216 | OQMD |
structures-34170 | 16.624 | spacegroups-225 | OQMD |
structures-pri-42 | 41.13742744 | spacegroups-pri-42 | Material Project |
structures-pri-21 | 44.91792373 | spacegroups-pri-21 | Material Project |
structures-pri-55 | 11.41878254 | spacegroups-pri-55 | Material Project |
structures-pri-149 | 21.21334856 | spacegroups-pri-149 | Material Project |
structures-pri-41 | 22.87020916 | spacegroups-pri-41 | Material Project |
Ontologies creation is milliseconds which can be ignored. Most of the time spent in the method is the conversion of individuals and the relationship between individuals. Figure 17 and Figure 18 show when the amount of data are millions, the conversion of individuals and the relationship costs several minutes. And the single query costs half seconds.
Currently, we have implemented a prototype system deployed in https://matdata.shu.edu.cn. And the interface for defining rules and semantic query looks like Figure 19.
A prototype system of ontology based data integration and semantic query.
VII. CONCLUSIONS AND FUTURE WORK
With the fast development of materials science, materials big data, especially come from high throughput experiments and computations, increase rapidly. However, different databases has their own schemas and structures which bring great challenges for data sharing and integration. The main work of the paper are as follows:
Presents a set of conversion rules to transform the relational materials database to ontology, that is general can be used in other areas.
Builds up a mathematical model for the materials ontology and the heterogeneous materials database, which allows to map the data in database to the individuals of ontology. So as to integrate the heterogeneous databases.
Considering the future work, several directions can be done further. Firstly, during ontology creation, reduce the manual interventions as little as possible without affecting accuracy. Secondly, separate the data and ontology physically to improve the query performance further. Finally, visualize the SPARQL construction, makes it easier for normal non-professional users to use friendly and conveniently.
ACKNOWLEDGMENTS
This work is partially sponsored by National Key Research and Development Program of China(2016YFB0700504, 2017YFB0701601), Shanghai Municipal Science and Technology Commission(15DZ2260301), Natural Science Foundation of Shanghai(16ZR1411200). The authors gratefully appreciate the anonymous reviewers for their valuable comments.