Reasoning-aware, format versatile Web querying that makes querying as easy as creating data

Segment 1: Data On the Web


This segment is a Beginner's introduction to Web data: It looks at what common data on the Web, like you interact with daily, looks like on a representation level. This is all Web applications get to use to do all the things possible nowadays: render it in a browser, use it to show a calendar, display it on a map, or merge and integrate it with other data to automatically find corresponding items.

HTML and HTML Aggregation

Lets start as many of you may do every day: Let's go to a browser and enter the URI of some page you are interested in.
s1-c0-cnn

This is (more or less) what your browser shows you when browsing to cnn.com. However for the browser it looks quite differently. You can actually find that out by selecting View Source (or something similar) in your browser's menu. Let's look at that with a bit more bells and whistles:

s1-c1-html



So what are we seeing here? Obviously there is the textual content of the Web site but there is also a lot more. Compare, e.g., the headline and the body of the news item: One is included in a h2, the other in a p element. These things are called start and end tags and form together the boundaries of an element. Elements are the fundamental units of the most common representation format on the Web, called XML. This particular Website, as most of the Websites you are familiar with, uses HTML or XHTML which is (roughly speaking) a dialect of XML. The important thing about this are twofold: keep in mind that there are elements they have labels (like p and h2) which indicate different types of information and they can be nested (see all the div's).

So why should you care? The short answer is this: Knowing how Web applications see Web information allows you to create Web applications that combine and reuse information from different sources on the Web. Like Google does in Google News.

RSS and Feed Readers

Before we get to the long answer to that question, bare with me a moment longer while I show you some more Web data. CNN like most news organizations offers its news also in RSS format. You see that by an RSS link on the page or, if your browser supports it, by an icon such as the blue RSS icon here in Safari. Clicking on that brings up my browser's rendering of the RSS feed from CNN. RSS is used by many news readers to aggregate information from different news pages, blogs, event publishers etc. You can read RSS in many browsers, using online aggregators such as Google Reader, or using desktop applications, e.g., NetNewsWire.

With RSS you can already see that there are many different ways to integrate and aggregate the plain data available on the Web. So, let's again take a look at what applications such as NetNewsWire or Google Reader see when accessing RSS data. Even with syntax highlighting and indentation it's still fairly tedious to find out more about the data in this way. So let's switch to a visualization application for arbitrary Web data we developed.
s1-c2-rss-vis

Here we have opened another RSS feed, the one from the REWERSE Web site (enough advertisement for CNN already). The application shows the hierarchy in the data much easier than a text editor and lets you fold and unfold elements (remember the stuff between tags like p and h2). We can do that for RSS, finding, e.g., the items in an RSS feed consist of title, link and data entries.

This visualization works for HTML as well, e.g., for the projekt page of REWERSE.
s1-c3-html-vis

Here we see how the data on REWERSE is structure. Let's take a quick look at the browser rendering of REWERSE. You can easily see, how different parts of the Website are represented in the Web data, e.g., headlines, unordered lists (ul), etc.

However, you might notice that HTML only conveys such information about the structure of a Web site not about its content. Say I would like to find out that this Web site is about a project related to Web querying. How would I go about that? Unless Web querying occurs in the content of the data (and I'm willing to mess with understanding of natural language, something we can't yet teach our computers very well), that's going to be hard or impossible with HTML alone.

This deficiency has been led to the development of XML and RDF, part of the future (or Semantic) Web where computers (and not just humans as today) are supposed to be able to not only present data but understand it sufficiently to do clever automatic processing.

Amazon Product Information in XML


A first step in this direction is the use of more application specific vocabularies, e.g., in XML instead of HTML pages. Amazon, e.g., offers a means (called a Web service) for accessing richer structured information on Amazon products than the Web pages offer.
s1-c4-amazon-vis

This is already much easier to use by an application such as Delicious Library that use information from Amazon to get at metadata about books (or products in general).

Social Network Information in RDF

The next step, in the view of the W3C, the standards organization behind the Web, is data represented in RDF. RSS is actually (somewhat) of an example for that data. Another frequently used example is social networking data, e.g., at FLINK. Let me show you the FLINK data on Francois Bry, REWERSE's scientific coordinator, showing a small excerpt of his professional network from past ISWC conferences.
s1-c5-foaf-vis

Now, this data is fairly easy to process automatically and, in particular, to merge and aggregate from different sources. It might be use, e.g., to enhance conventional Web sites, to extend searches for Francois' papers to papers by people professionally close to him, or, like FINK does, simply to visualize networks of researchers.

Summary

To summarize, we have taken a brief look on how Web data looks for a Web application: a hierarchy or graph of labeled (with URI or simple tag) elements. We have seen that there are different formats of data such as XML, RSS, HTML, or RDF available on the Web. We have finally seen that HTML isn't the best format for automatic processing without natural language understanding, but that other Web formats make that fairly easy, thus transforming the Web form a database of information merely for humans to a database usable also for computers offering exciting possibilities for richt automatic data processing.

From this segment, you can continue learning about the idea of Web querying or immediately jump into application samples with Xcerpt, the Web query language developed in the REWERSE project.