Segment 2: An Introduction to Web Querying
This segment is a two-part introduction to Web querying: In the first part we will take a look at the aims and promises of Web querying, in the second part we will do a tour-de-force highlighting core features of Xcerpt, the Web query language developed in REWERSE.
Defining “Web Querying”
First, we should clarify what we mean when talking about “Web querying”. Let's start with the current status of information access on the Web: When you look nowadays for information on the Web, there are two main ways you can go about it:Either you know already fairly well, where the information you are interested in resides, in which case you (or the application you are using) provide a URI to directly retrieve the information. E.g., to get information about query languages you might go to
wikipedia.org/wiki/query_language.
If you don't know where the information might reside, you turn to a search engine such as Google or Yahoo to find possible sites from where to retrieve that data, e.g., by “entering researcher Web querying”
As usual, things aren't that black and white. For example, you might know that the information you are looking for is at a certain site and use that site's local search instead of a search engine.
In either case, however, what you retrieve is a unit of information predefined by the information publisher, usually in form of a Web site. Try answering questions such as “who are the institutions employing preeminent researchers on Web querying” this way. It's not going to be easy unless someone already provides a pre-compiled answer to this information. Currently, you have to manually look at all those Websites returned by Google and look through them to decide who qualifies and where he is working.
Web query languages come into play at this point: Rather than find sites from where to retrieve information you are interested in (as search engines do), Web query languages allow the extraction of specific data from already known sites (or sites returned from a search engine request).
Lets assume we want to build a service aggregating price information from different shopping sites such as Amazon. What we can do easily (and without using a query language) is access information about a certain product: Amazon provides a nice query interface for that: You give a URI with some parameters including, e.g., keywords for items sought or identifiers, if you already know what to look for.
http://ecs.amazonaws.com/onca/xml?
Service=AWSECommerceService&
AWSAccessKeyId=XXXXXXXXXXXXXX&
Operation=ItemLookup& ResponseGroup=Medium&
ItemId=B00008OE6I
However this interface, as most such interfaces, is limited to simple parameter-value pairs: You can set a number of fixed parameters to limit the extent of the returned data. However beyond those fixed parameters you are on your own. Also the format of the answer is pretty much the same every time. We can't really look at it in a browser since its no HTML but Amazon's own XML format, that my browser just doesn't know about. Let's look at that data again in a more accessible version:
What you see here is that a lot of information about the selected products is returned. Extracting just the price (and possibly further processing by converting currency, adding tax, etc.) requires further analysis of the data drilling down to only the information we are interested in.
From Search to Query
This sort of task is what Web query languages excel at. They provide a declarative means of specifying what data to extract from such a product description. In contrast, conventional programming languages require you to use APIs such as DOM or SAX. Though these are powerful tools also suited for manipulating the data, they require very detailed knowledge of the queried data as they force the programmer to specify one direct navigation through the data.Web query languages figure out automatically how to navigate to the data requested given a rather high-level specification from the programmer. They also excel at specifying queries even if you don't know the exact extent of the retrieved information or if that changes over time. E.g., to retrieve the price it might suffice to simply say: Find elements with a certain label (here ListPrice) anywhere in the data. Changing the position of the price element does not affect the specification of the query (though the execution may change, but since that is left to the query language, the programmer need not to care).
Currently, there are a number of Web query languages but all of them are limited to specific Web formats, e.g., WebSQL for HTML, XQuery & XSLT for XML, SPARQL for RDF. Just Google for any of these names to find further information about them. You can also listen to the state-of-the-art segment in this series for further information about these languages.
In REWERSE, we have looked at the premise that access to data in multiple formats is a necessity for information integration and processing on the Web and, hence, should be supported transparently by a Web query language. For a recent book, we have put together a collection of resources on Web querying and versatility which you can visit at
http://www.squidoo.com/versatile-web-query-languages/.
The collection consists in a brief discussion of the
idea of versatility for Web query languages and
continues to give links to use cases, surveys, and
concrete exemplars of Web query languages. Do take a bit
of time to look at those of the links!
Querying with Xcerpt
Returning, to our Amazon example. Lets take a shot at extracting the price with Xcerpt, the first versatile Web query language with access to data in multiple formats.
Xcerpt has been developed by our working group in the REWERSE project under the lead of Francois' group at the University of Munich. Aside of the versatility which will be demonstrated in later segments (to understand how it is use you are going to need to get the basics first), Xcerpt is also, in our belief, a very convenient query language. The main principles of Xcerpt are first the use of data as queries in the spirit of what is called query by example and second the use of rules for reasoning with and chaining of Web queries. Let's see how that look in practice: I have already prepared a query to do just that. However, its also very easy to do that from scratch and we will show you that in one of the following segments.
If we want to write a query, we can simply turn some data fragment into a pattern (or example) of the data we want to retrieve. However, we can "under-specify" the pattern incompletely identifying the data: Here we choose to say that the data should have a
ItemLookupResponse
root and contain anywhere nested inside (indicated by the
descendant "we go as deep in the hierarchy as we want to")
another element labeled ListPrice.
ListPrice in turn contains, directly this time
(no descendant), an Amount element with the
actual data inside. The data is selected in a variable
called Title, so that we can reference it
later (like in the result).
The first thing to note is that the labels reflect the labels of the data, however, there is much less noise in the query. We just specify those parts that we are interested in, leaving open the rest. This is achieved by two means of incompleteness: in depth indicated by the descendant and in breadth indicated by the dotted insted of solid borders of the elements. It means that we look for
ListPrice elements anywhere in a
ItemLookupResponse and that there may be other
things besides ListPrice elements in an
ItemLookupResponse. There also may be other
things besides an Amount element in
ListPrice, however the Amount
must occur directly inside. The actual price value is the
only thing contained in the Amount.
If we execute this query only the sought-for data is returned. We could also process this data, e.g., do some arithmetics with it like adding state tax or comparing it with the price of the same item in another shop.
In following segments we will take a closer look at authoring such queries and at the precise meaning of several of Xcerpt's query constructs.