Reasoning-aware, format versatile Web querying that makes querying as easy as creating data

Practical Querying with (vis)Xcerpt: The Basics

This segment is the first of several related Webcasts on querying, extracting, and processing Web data with the visual rendering of Xcerpt, called visXcerpt. We will compose a fairly simple query on real-life Web data from scratch learning about concepts such as incompleteness in breadth and depth. You will see that querying in visXcerpt and Xcerpt really isn't much more complicated than writing Web data thanks to Xcerpt's pattern or example based approach to querying.

Amazon Product Data (Again)

We are going to continue with the use case we used in previous segments: Product data from amazon. However, in this segment we will focus on how to author a query from the scratch given only the input data (as returned from Amazon) and the query intent.

Recall that the data is retrieved from the Amazon web service by asking about characteristics of a specific item, in this case a digital camera, an item we might have found by a previous search request, e.g., with appropriate keywords. The query we would like to author retrieves from all this information about the camera only its price (if given in USD).

Lets take a look at the Amazon data again: Its data about one item in the Amazon database requested specifically.
Pasted Graphic

First step: Changing the URI

So lets go to the visXcerpt interface and start with a really barebone query. It consists of only a single goal. The result of a goal becomes the result of the entire query program. The goal queries some default URI for a top-level element called SampleElement. If such an element is found it constructs a result element as output of the query.

Lets start adapting this query to access our Amazon data: First we have to change the URI to match the URI of the data we want to query. In this case, I have mirrored the Amazon response locally so we can simply replace simple.xml with amazon-data.xml.

query-base
What would happen if we execute the query right now? We get an error as answer with the explanation that there were no results to this query.

Second step: Top-level element

Why is that? Lets look at the data to figure out why this query failed.

I have prepared the data in a separate frame: The first thing to note is that the query asks for SampleElements whereas the data provides ItemLookupResponse elements at the top-level.

So let's change that and reexecute the query ...
error-msg
Hmm, we still get the same error? What did we do wrong?

Third step: Partial Data Terms

Let's compare again the query with the data. Notice, that in the data ItemLookupResponse contains several elements nested inside (called affectionately children). However in the query it does not. Should we add these children? But then we would also have to add their children and so and so forth. That's not really what we want to when querying. The only thing we care about is that there is an ItemLookupResponse not what its children are.
query-data
Exactly for that Xcerpt provides the concept of partial (or breadth incomplete) queries. A partial query permits specifically additional sub-elements to occur inside of a marked element. In visXcerpt we mark an element by making its border dotted instead of solid:
partial
So lets execute the query again. Aha, something changed. Now we get a result element back.

But that's still a fairly boring answer. The only thing we get from that result is knowing that the data at the URI specified in the query indeed has ItemLookupResponse as top-level element. However, what we are interested in is the price of the item returned in the ItemLookupResponse!

Forth Step: Adding Structure

For that, we start by adding the ListPrice element to the query. It should occur inside the ItemLookupResponse so let's past it there.
structure01
However, now the query is overspecified. Selecting the price should not depend on the formatted price. Remember, whenever we omit a subelement from the data in the query, we have to make the parent term partial.

Also we don't want to specify the price in the query (thus selecting only items with price 499,99$) but we want to extract the price. How do extract the price? With a variable ...
variable
Looks good to you? Let's try it ... fails again ... why?

Fifth Step: Descendant and Variables

There are intermediate elements which we need to tell Xcerpt to ignore. This we do by adding a descendant.
descendant
Let's try again? Works, but where is the price ... For that we need to add variable to result. That's it. Now we are selecting exactly what we want!
result

Summary

To summarize, we have seen how data is turned into a query and how we can exploit incompleteness constructs (partiality and descendant) to make a query focus only on what is given by the query intent.

In further segments we look at more expressive patterns (with negation and optional), at aggregation and grouping (allows us to return multiple matches of the same query), and at rules in Xcerpt (that allow us to separate a query into small, focused, and easy to understand segments).