Screen Scraping

By “screen scraping” it is generally intended a software technique aimed at extracting data from an html page – without any linguistic analisys of its content.

XPath

Curiosity performs screen scraping by means of XPath. In short, XPath is a query language for xml documents: for example, given some xml, in XPath you can say  “give me all div nodes whose class attribute is title, and this would result in the following XPath:

//div[@class = 'title']

A good XPath Quick Reference can be found here.

Tidy

Nevertheless, not every html file is an xml file. Curiosity is able to convert html files in xml by means of Tidy, which is an open source program bundled in Curiosity itself.

Tidy has a number of options which can be configured using an external file. Curiosity has a default Tidy configuration file (named default.tidy) which should work for most web sites. If you have problems running Curiosity on a particular web site, you can try to manage the Tidy options for that particular web source: create your own file named a_web_source.tidy and refer it in the configuration of the web source:

<webSource name="aSource">
    <tidyConfig>a_web_source.tidy</tidyConfig>
</webSource>

Curiosity Studio

So, you have the xml version of a given web site, and you can enginer all the xpaths that you need. For example, given a news site, you tipically want to have an xpath for retrieving news titles, dates, abstracts, authors and possibly attached images.

In order to help you in the task of creating xpaths for your sites, Curiosity is distributed with a visual tool named Curiosity Studio which:

  1. displays the document as it would be by IE
  2. let you select the relevant information (e.g. a news title)
  3. finds out the corresponding node in the xml structure
  4. displays the corresponding XPath

Curiosity Studio can be executed by running the CuriosityStudio.exe executable.

In the Quick Start tour, you can follow a step-by-step demonstration of Curiosity Studio, and see some screenshots too.

You can select several titles in your news page, look at their xpaths, and then generalize in order to find out the xpath that will retrieve all the intended nodes.

In many cases, it will be enough to select just the first title, as its xpath will already be your target.

Moreover, you should pay attention at the css 'class' attribute of the nodes: many times it will be enough in order to identify all the nodes of a given kind.

An xpath will return an xml node. Then, you must choose which attribute of the node will be kept, and you have four options:

  • one between the html attributes of the node (e.g. the href one in the case of links)
  • the inner text of the node
  • the innerxml, that is the html code of all elements placed inside the node
  • the outerxml, as innerxml plus the html code of the node itself

Moreover, you can optionally use a regular expression in order to select a specific part of the extracted text.

Information Slots

The combination of xpath, attribute and regular expression makes an information slot: it must have a name, and is encoded in curiosity.xml in the following way:

<slot name="Title">
    <xpath>/html/body/table/tr/td/a/img</xpath>
    <attr>src</attr>
    <regexp>
        <exp>.*photo_archive/(.*)-thu.*$</exp>
        <subst>$+</subst>
    </regexp>
</slot>

Note that Curiosity Studio allows you specify just the slot name and xpath; if you need them, you have to manually add the other features to curiosity.xml.

For every webSource, all the information slots must be placed under a slots tag.

<webSource name="aSource">
    <slots level="0">
        <slot name="Title">
            <xpath>/html/body/table/tr/td/a/img</xpath>
            <attr>src</attr>
            <regexp>
                <exp>.*archiviophoto/(.*)-thu.*$</exp>
                <subst>$+</subst>
            </regexp>
        </slot>
        <slot name="Abstract">
            <xpath>/html/body/table/tr/td/span</xpath>
        </slot>
    </slots>
</webSource>

Note that if you leave unspecified the attr value, the inner text will be used.

Primary Keys

If you want the Notification Engine being able of finding out also modified and deleted items, you must have a slot value that uniquely identifies one item: we will refer to such slot as a “primary key”.

You must also set the web source fullDiff attribute to true.

<webSource name="aSource" fullDiff="true">
    <slots level="0">
        <slot name="TitleLink" isKey="true">
            ...
        </slot>
        ...
    </slots>
</webSource>

You may want to have as primary key one of the default slot added by Curiosity:

<webSource name="aSource" fullDiff="true">
    <slots level="0" key="sourceUrl">
        ...
    </slots>
</webSource>

Pivots

By default, the extracted values for every slot are grouped by means of a position based policy: e.g. the first image is grouped with the first person name and so on.

If a slot has in a page less values than the other ones, as a consequence values will likely be wrongly grouped.

So, if some slots are optional in a web source, you can try to set to true the usePrefix attribute of the slots tag:

<webSource name="aSource" fullDiff="true">
    <slots usePrefix="True">
        ...
    </slots>
</webSource>

Curiosity will try to extract a “pivot” from all the xpaths - that is it will try to check for the existence of an xpath that is the longest common prefix. If the pivot exists (you may want to define xpaths in such a way that Curiosity can actually extract the pivot), Curiosity will firstly query the page for all pivot nodes, and then extract the slots starting from that nodes. This way, missed slots will be properly detected.

History Management

Curiosity will normally keep the history synchronized with the web source: if a slots instance (e.g. a single news) is deleted from the web source, then it is also deleted from the history.

Anyway, you may want to have slots instances retained in the history even if, at a given execution, they have been detected as removed.

<webSource name="aSource" historyKeepAll="true">
    ...
</webSource>

For example, this could be useful if deleted instances will be listed again in the future and we do not want to mark them as new every time.

The historyKeepAll attribute can be applied to other sources too (e.g. to the file system source).

next