Practical Querying with (vis)Xcerpt: The Basics
This segment is the first of several related Webcasts on querying, extracting, and processing Web data with the visual rendering of Xcerpt, called visXcerpt. We will compose a fairly simple query on real-life Web data from scratch learning about concepts such as incompleteness in breadth and depth. You will see that querying in visXcerpt and Xcerpt really isn't much more complicated than writing Web data thanks to Xcerpt's pattern or example based approach to querying.Amazon Product Data (Again)
We are going to continue with the use case we used in previous segments: Product data from amazon. However, in this segment we will focus on how to author a query from the scratch given only the input data (as returned from Amazon) and the query intent.Recall that the data is retrieved from the Amazon web service by asking about characteristics of a specific item, in this case a digital camera, an item we might have found by a previous search request, e.g., with appropriate keywords. The query we would like to author retrieves from all this information about the camera only its price (if given in USD).
Lets take a look at the Amazon data again: Its data about one item in the Amazon database requested specifically.
First step: Changing the URI
So lets go to the visXcerpt interface and start with a really barebone query. It consists of only a single goal. The result of a goal becomes the result of the entire query program. The goal queries some default URI for a top-level element called SampleElement. If such an element is found it constructs a result element as output of the query.Lets start adapting this query to access our Amazon data: First we have to change the URI to match the URI of the data we want to query. In this case, I have mirrored the Amazon response locally so we can simply replace simple.xml with amazon-data.xml.
What would happen if we execute the query right now? We get an error as answer with the explanation that there were no results to this query.
Second step: Top-level element
Why is that? Lets look at the data to figure out why this query failed.I have prepared the data in a separate frame: The first thing to note is that the query asks for SampleElements whereas the data provides ItemLookupResponse elements at the top-level.
So let's change that and reexecute the query ...
Hmm, we still get the same error? What did we do wrong?
Third step: Partial Data Terms
Let's compare again the query with the data. Notice, that in the data ItemLookupResponse contains several elements nested inside (called affectionately children). However in the query it does not. Should we add these children? But then we would also have to add their children and so and so forth. That's not really what we want to when querying. The only thing we care about is that there is an ItemLookupResponse not what its children are.
Exactly for that Xcerpt provides the concept of partial (or breadth incomplete) queries. A partial query permits specifically additional sub-elements to occur inside of a marked element. In visXcerpt we mark an element by making its border dotted instead of solid:
So lets execute the query again. Aha, something changed. Now we get a result element back.
But that's still a fairly boring answer. The only thing we get from that result is knowing that the data at the URI specified in the query indeed has ItemLookupResponse as top-level element. However, what we are interested in is the price of the item returned in the ItemLookupResponse!
Forth Step: Adding Structure
For that, we start by adding the ListPrice element to the query. It should occur inside the ItemLookupResponse so let's past it there.
However, now the query is overspecified. Selecting the price should not depend on the formatted price. Remember, whenever we omit a subelement from the data in the query, we have to make the parent term partial.
Also we don't want to specify the price in the query (thus selecting only items with price 499,99$) but we want to extract the price. How do extract the price? With a variable ...
Looks good to you? Let's try it ... fails again ... why?
Fifth Step: Descendant and Variables
There are intermediate elements which we need to tell Xcerpt to ignore. This we do by adding a descendant.
Let's try again? Works, but where is the price ... For that we need to add variable to result. That's it. Now we are selecting exactly what we want!
Summary
To summarize, we have seen how data is turned into a query and how we can exploit incompleteness constructs (partiality and descendant) to make a query focus only on what is given by the query intent.In further segments we look at more expressive patterns (with negation and optional), at aggregation and grouping (allows us to return multiple matches of the same query), and at rules in Xcerpt (that allow us to separate a query into small, focused, and easy to understand segments).