Semantic Software Architecture
Semantic software architecture enables a federated information architecture where individual groups operate their own information systems, while the information in those systems can be accessed and analyzed by any other properly authorized group. It is based on W3C standards including: HTTP, URI, URL, XML, XSD, RDF, RDFS, SPARQL, RIF and OWL. Since HTTP operates on any TCP/IP based network, this solution will run on most enterprise networks. Any of the components can be exposed as web services to fit seamlessly within an SOA.
Semantic Software Architecture enables a very different approach to reporting and analysis. It does not depend on the building of data warehouses, instead it relies on the natural federation and integration capabilities enabled by the W3C standards. All information is exposed as RDF and is usually accessed through SPARQL endpoints.
The software architecture is model driven using RDF and/or OWL models to enable the capability of the various semantic software components. The software architecture necessarily starts with OWL/RDF editing tools like Protege, TopBraid Composer, Knoodl.com, and SemanticWorks. Using any of these tools, models of business domains, databases, mappings, business processes, services, and spreadsheets, among many other things, can be built and used by any semantic software components. Of these types of models, the "business domain ontology" is the most important and serves to establish the semantic baseline for all models used in each solution.
Since the software is model driven by ontologies, it is critical to design and implement an ontology architecture. Ontology architecture is a new aspect of system architecture and development, to our knowledge it has not been employed anywhere else in DOD. In a semantic software architecture paradigm, ontologies can be thought of like software code, the models are used in place of code and enable a degree of flexibility and extensibility that is not available in systems that are not model driven, where all of the functionality of the solution is "hard-coded". Since any well engineered information system would begin with a software architecture, and ontologies have the same function as code, it is necessary to design and implement an ontology architecture that complements and enables the system architecture. These models are built by OWL modelers, not programmers, enabling many more people to engage in creating the functionality of the solution.
Other software components included in a semantic architecture include RDF triple stores, Spyders, Federators, query designers, and viewing/dashboards. All of these capabilities are delivered using standard web technologies.
The function of the Spyder in the software architecture is to enable owners of any sort of information system to expose the contents of the system as one or more RDF graphs. It is possible to access the information as RDF from each system using an "HTTP GET" operation, but normally the RDF is obtained by a SPARQL query using the SPARQL endpoint enabled by a Spyder. A Spyder will introspect each data source and expose a number of metrics about the source to enable efficient query planning and execution. Two kinds of ontologies are used in the operation of the Spyder, the source ontology and the mapping ontology. The source ontology informs the creators of the mapping ontology concerning the schema of the source. The mapping ontology enables the transformation of data source elements into domain ontology elements.
The Mapping Ontology
The mapping ontology enables OWL modelers to describe how data stored in a database should be translated to RDF. The translation can be described based on an existing ontology or a data-source specific ontology. Our approach depends on these descriptions being based on the Business Domain Ontology. This translation can then be used by a Spyder to either translate a SPARQL query to the appropriate SQL query dynamically or to a bulk dump of the data as RDF.
At a high level, the two most basic things the mapping ontology lets a modeler describe are:
(1) Instances of classes in an ontology from values in database tables (including generating URIs for those instances)
(2) Adding properties to those instances based on values in database tables
The most simple case is mapping the primary keys of a table to a class in the ontology and then mapping each column to a property. This would essentially just expose the data in its natural structure as RDF instead of relational.
The way mappings are described enables a high degree of analysis concerning how the database schema relates to the ontology it is being mapped to. Users can query to see how instances of a class in the ontology are created from a database or how properties are added to those instances. They can also perform two-way "gap analysis". From the direction of the ontology, they can query to see which classes and properties are mapped to a data source and which are not. In a federated environment, a user could see for a particular domain ontology concept which databases are mapped to the concept. From the direction of the data source, a user can query to see which tables, columns, and keys in the database are mapped to the ontology and which are not.
One approach to analyzing data from more than one source that has exposed RDF is to load the data from all sources into an RDF store, but this has some of the same problems as creating a traditional data warehouse. A semantic architecture integrates data from any number of sources using the inherent capability of RDF and OWL to combine and extend graphs dynamically. The federator accesses data from any number of RDF enabled sources and combines the individual source graphs into a single graph over which queries can be executed. To perform this task, the federator needs to access one or more domain ontologies. As long as a domain ontology is available, the federator will know how to combine nodes in disparate RDF graphs so that queries can be executed over the entire data set. The federator is model driven by the domain ontology. A critical benefit to this approach is that the domain ontology can be continually extended and all of the infrastructure components and queries continue to operate.
The federator implements sophisticated query costing and query planning algorithms so that queries from users are executed in the most efficient manner. The most important element of efficient query execution is to deal with as little data as is possible and still execute the query. This requires that many operations are delegated down to the native data stores and performed before the conversion to RDF. As mentioned earlier, a Spyder exposes in detail the capabilities of the native source to the federator which then uses the source capability information in query planning and execution, delegating as much processing as possible down to each native source.
The federator uses a rules engine to enable the processing of certain types of queries. These rules fall into two broad classes, business rules and inferencing. The rules engine is model driven using models of rules defined using the RIF semantics.
In some cases, queries will be operating on data sets that are too large to be manipulated in memory, in which cases the graphs will be materialized in an RDF triple store to enable efficient query execution. The triple store is closely integrated with the federator and acts as a persisted cache.
The Business Domain Ontology
The primary objective of a federated information environment is to be able to submit a single query and get results from multiple data sources. In order to do this, there must be an ontology that defines the terms used in queries for the domain. We call this ontology the Business Domain Ontology. In the ETL world, this ontology is analogous to the logical data model of a data warehouse. It is important to note that, unlike many conceptual models, the Domain Ontology is executable in its native form (RDF/OWL). By executable here, we mean that queries can be written against it that will produce results from actual, physical data. It is more similar to a relational data model or an XML schema than it is to a UML model or an ERD diagram.
The Business Domain Ontology is not designed in a top-down way. This means that we make no effort to comprehensively model a particular domain. Instead, the ontology is designed and built as the project progresses to account for new requirements. Requirements for some projects come in the form of analytic requirements that need to be supported and data sources that need to be exposed. These two sources of requirements can be looked at as the forces of supply and demand in the information environment - analyses defines the demand, data sources define the supply. Like nearly all other systems where there is an interplay between supply and demand, in our environment, demand is more important and drives supply. If you are asking questions and you can't get answers, there is a problem. Data that could be supplied but isn't being supplied is only a problem if it is in demand. Requirements may also come from business processes or applications that wish to use the domain ontology as their information model. We will use existing models of the domain to create an initial "base-line" Domain Ontology. By doing this, the burden is lessened when the first requirements are modeled.
In order for the Domain Ontology to support any queries that a user would like to execute, the concepts needed by the query must be present in the Domain Ontology. When a new reporting requirement is discovered, an ontologist should try to write a query using terms from the Domain Ontology. If this is impossible, given the terms that are in the Domain Ontology, the ontologist should extend the Domain Ontology to include the necessary concepts. By following this process, all reporting requirements will always be satisfied by the Domain Ontology.
When we say that a data source can be used as a requirement for the Domain Ontology, we mean that the necessary concepts must exist in the Domain Ontology to expose all the data we want to expose from a data source. Our general approach is to first do an initial set of "base-line" mappings that expose the data in as natural a way as possible. For the most part, this means that we will map each table to a class and then map each column to a property. When deciding what classes and properties to map to in the Domain Ontology, it is first priority to map to existing concepts. If no concepts exist which are semantically equivalent to a table or column, then an ontologist will extend the Domain Ontology so that they are.
Semantic software architecture is completely distributed as it is all web based. The Spyders can be exectuted behind firewalls local to data stores or in hosting centers with a JDBC connection over a wide area network to a data source. As many instances of Spyders as necessary can be implemented for each project, they can be shared across projects. The federator can run behind a firewall or in a hosting facility, and as many instances of the federator as necessary can be started, so that federators can be nested.
A semantic repository is required in the architecture to hold and manage all of the ontologies that drive the solutions. It consists of an RDF triple store that can store and manage RDF graphs, as well as the functionality to understand OWL semantics. It must enable querying of the ontologies either as a collection or individually. Both the federator and the Spyder will access the repository at run-time to obtain the collection of ontologies necessary to accomplish the goals of the application. There is a mechanism to download the ontologies form the repository to the Spyder and federator.
Since the mission for many projects employing semantic architecture is enterprise analytics, it must include a web based viewing or dashboard capability to meet the requirements. There are many web based viewers already existing, but they all issue SQL queries and display views of the results sets. It turns out that commercial viewers that were built for SQL can also work with SPARQL. These tools also have the capability to design views with varying levels of support for advanced analytics.