Reasoning-aware, format versatile Web querying that makes querying as easy as creating data

GData and Atom Processing in Xcerpt

Touted as the successor to the widely deployed, but technically rather unsatisfying RSS, Atom is a recent IETF standard for Web feeds. Together with the Atom publishing protocol it allows not only access to, but also creation, change, and deletion of entries in Web sites such as blogs or other collections of data. Indeed, Google uses an extension of Atom, dubbed GData, as API for its calendar service GCalendar. In the following, access to GCalendar from Xcerpt for structured display of upcoming events is discussed.

Introduction

The following is a brief description of a small feed processor used to generate the Web site of the PMS/CIS graduate seminars. The feed processor is implemented using Xcerpt and based on a Google calendar used to manage the information on upcoming and past sessions of the graduate seminar.

The salient feature of the Web site and feed processor are:

  • The use of Google calendar feeds allows us (a) to edit the calendar collaboratively, (b) to provide iCal and Atom feeds of the graduate seminar sessions, and (c) allowed us to demonstrate the ability to "feed real feeds" to Xcerpt.
  • Sessions of the graduate seminar are grouped into current (within the next two weeks), into past, and into future sessions.
  • Google provides only general event metadata. Information specific to graduate seminar sessions is extracted from the atom:contents element of the feed. Also in other places Xcerpt's regular expression facilities come in handy to process, e.g., time and location information.
  • The processing of the Google feeds takes about 10s on a reasonably up-to-date machine, of which about 2-3s are communication time. Therefore, we have made the page creation asynchronous, every 30 minutes a new version is generated based on the current information in the Google calendar.

Data

Google Calendar is one among many recently launched Web-based calendar management applications. For our purpose it is set apart by providing a clean and well-documented API to access the information in a calendar.

Google Calendar Feeds

Calendar information can be accessed either in iCal format or as Atom/GData feeds. Atom is an IEEE endorsed (RFC 4287) so-called "syndication" format. What that really means is that an Atom document is a list of entries (ranging from articles over calendar events to changes in a Subversion repository) together with associated metadata. Often these entries are related to some temporal events and sorted accordingly, e.g., the publication date of the article or the commit time of a change. A detailed Atom feed is shown below.

The Google Data API, short GData is based on Atom and includes both a query API (accessed using HTTP GET requests) that serves Atom feeds as answer and an update API (accessed using HTTP PUT) that implements the Atom publishing protocol. For the purpose of this application, currently only the query API is used and GData feeds are Atom feeds enhanced by some XML elements in the GData namespace (viz. http://schemas.google.com/g/2005). A detailed GData feed is shown below.

Technology

On the technology side of things, we use basically two things: Xcerpt, a Web query language developed at the PMS chair, and Java needed as glue due to limitations of the current Xcerpt implementation.

Xcerpt

Xcerpt is a Web query language developed at the PMS chair and as part of the I4 working group of the REWERSE EU network of excellence. It is distinguished from most other Web query languages by the use of rules and patterns instead of functions and navigational path expressions for accessing XML and RDF data. For more details, please take a look at Xcerpt's homepage and the publication list of the PMS chair.

For this application we make little use of Xcerpt's rules. Rather Xcerpt's query and construct terms proof to be expressive enough to do all required data processing in one or two rules. The main rule accesses a Google calendar feed containing all sessions for a particular time-frame (e.g., for the next two weeks or for the past semester) and extracts all the needed information from that feed. The detailed pattern is for the current sections is shown below.

Java

Some Java glue is needed to overcome some of the shortcomings of the current prototype of Xcerpt. Please note, that these are not shortcomings of the language Xcerpt, but rather of its current implementation. The main shortcomings are a lack of data processing functions and a bug related to variables in resource URIs.

The Java "glue" program generates three feed URIs for current, past, and future sections and, for performance reasons, also does some of the post processing.

The source code for the application can be found at http://svn.amachos.com/xcerpt/applications/2006/gcalendar/. The three Xcerpt programs used to generate the three chunks of the Web site (generate-# with # as current, past, and future sessions) are generated at run-time from template files. As one can easily see the actual Xcerpt rules, though lengthy due to the HTML construction, are rather simple. No chaining takes place, mostly due to expressivity and speed limitations of the 2004 Xcerpt prototype. If you have comments or questions about this application, feel free to post them here or contact Tim Furche or any other member of the Xcerpt team.