RDF Normalization and Parsing in Xcerpt
Accessing RDF in a general semi-structured query language such as Xcerpt requires the parsing and normalization of RDF serializations. Unfortunately, there are many different serializations for RDF (cf. Oliver Bolzer's master thesis for an overview). Furthermore, the W3C recommended serialization called RDF/XML is often criticized for its unnecessary complexity and representational variety. To access data in RDF/XML we present a set of rules that normalize that data into an easy to understand internal triple representation (similar to RXR or N-triples).
This ruleset is part of our work started in 2004/2005 on a better integration of RDF into Xcerpt. Xcerpt has been conceived from the very beginning as a versatile query language capable of accessing any form of semi-structured data. However, in practice each format has its own, specific challenges.
In the case of RDF, our first step to supporting RDF in Xcerpt has been twofold:
- Provide Xcerpt rule libraries that can parse and "normalize" RDF data into canonical Xcerpt representations;
- Provide Xcerpt rule libraries that handle specificities of the RDF semantics, i.e., that (transparently) extend the RDF graph in accordance to the RDF(S) entailment rules from the RDF model theory, see [rdfs-reasoning-in-xcerpt].
For details on how these rules are constructed and what shortcomings they have, please refer to Oliver Bolzer's master thesis on RDF access in Xcerpt.
All files pertaining to this application of Xcerpt can be found at http://svn.amachos.com/xcerpt/applications/2004/rdf-normalization/. Feel free to access them using any browser or to check them out with any Subversion client.
It might be helpful to keep a few points in mind while looking at these rules:
- First, due to the large number of rules, in
particular of rules contributing data to
RDF-TRIPLE, the evaluation of this rule-set is not advisable with the 2004 Xcerpt prototype. We hope that the new, abstract-machine based implementation of Xcerpt will make programs of this kind practically more feasible. - Second, these rules use 2004 Xcerpt syntax which differs quite noticably from the current Xcerpt syntax described in Deliverable D6.
- Third, the representation of RDF in Xcerpt is, admittedly, suboptimal. The goal for this rules was to get them working in the 2004 Xcerpt prototype and this required certain compromises. We are currently working on a better representation and processing of RDF in Xcerpt, please contact Benedikt Linse for more information.
It is likely that there are some inconsistencies in this rules. In the process of revising an RDF representation in Xcerpt, we will also take another look at these rules and fix any remaining errors. If you find any problems or have questions please feel free to contact either member of our team or post to the xcerpt-devel mailing list.