By “screen scraping” it is generally intended a software technique aimed at extracting data from an html page – without any linguistic analisys of its content.
XPathCuriosity performs screen scraping by means of XPath. In short, XPath is a query language for xml documents: for example, given some xml, in XPath you can say “give me all div nodes whose class attribute is title ”, and this would result in the following XPath:
//div[@class = 'title']
A good XPath Quick Reference can be found here.
Nevertheless, not every html file is an xml file. Curiosity is able to convert html files in xml by means of Tidy, which is an open source program bundled in Curiosity itself.
Tidy has a number of options which can be configured using an external file. Curiosity has a default Tidy configuration file (named default.tidy) which should work for most web sites. If you have problems running Curiosity on a particular web site, you can try to manage the Tidy options for that particular web source: create your own file named a_web_source.tidy and refer it in the configuration of the web source:
<webSource
name="aSource">
<tidyConfig>a_web_source.tidy</tidyConfig>
</webSource>
So, you have the xml version of a given web site, and you can enginer all the xpaths that you need. For example, given a news site, you tipically want to have an xpath for retrieving news titles, dates, abstracts, authors and possibly attached images.
In order to help you in the task of creating xpaths for your sites, Curiosity is distributed with a visual tool named Curiosity Studio which:
Curiosity Studio can be executed by running the CuriosityStudio.exe executable.
In the Quick Start tour, you can follow a step-by-step demonstration of Curiosity Studio, and see some screenshots too.
You can select several titles in your news page, look at their xpaths, and then generalize in order to find out the xpath that will retrieve all the intended nodes.
In many cases, it will be enough to select just the first title, as its xpath will already be your target.
Moreover, you should pay attention at the css 'class' attribute of the nodes: many times it will be enough in order to identify all the nodes of a given kind.
An xpath will return an xml node. Then, you must choose which attribute of the node will be kept, and you have four options:
Moreover, you can optionally use a regular expression in order to select a specific part of the extracted text.
Information SlotsThe combination of xpath, attribute and regular expression makes an information slot: it must have a name, and is encoded in curiosity.xml in the following way:
<slot
name="Title">
<xpath>/html/body/table/tr/td/a/img</xpath>
<attr>src</attr>
<regexp>
<exp>.*photo_archive/(.*)-thu.*$</exp>
<subst>$+</subst>
</regexp>
</slot>
Note that Curiosity Studio allows you specify just the slot name and xpath; if you need them, you have to manually add the other features to curiosity.xml.
For every webSource, all the information slots must be placed under a slots tag.
<webSource
name="aSource">
<slots
level="0">
<slot
name="Title">
<xpath>/html/body/table/tr/td/a/img</xpath>
<attr>src</attr>
<regexp>
<exp>.*archiviophoto/(.*)-thu.*$</exp>
<subst>$+</subst>
</regexp>
</slot>
<slot
name="Abstract">
<xpath>/html/body/table/tr/td/span</xpath>
</slot>
</slots>
</webSource>
Note that if you leave unspecified the attr value, the inner text will be used.
Primary Keys
If you want the Notification Engine being able of finding out also modified and deleted items, you must have a slot value that uniquely identifies one item: we will refer to such slot as a “primary key”.
You must also set the web source fullDiff attribute to true.
<webSource name="aSource"
fullDiff="true">
<slots level="0">
<slot
name="TitleLink"
isKey="true">
...
</slot>
...
</slots>
</webSource>
You may want to have as primary key one of the default slot added by Curiosity:
<webSource name="aSource"
fullDiff="true">
<slots level="0"
key="sourceUrl">
...
</slots>
</webSource>
Pivots
By default, the extracted values for every slot are grouped by means of a position based policy: e.g. the first image is grouped with the first person name and so on.
If a slot has in a page less values than the other ones, as a consequence values will likely be wrongly grouped.
So, if some slots are optional in a web source, you can try to set to true the usePrefix attribute of the slots tag:
<webSource name="aSource"
fullDiff="true">
<slots
usePrefix="True">
...
</slots>
</webSource>
Curiosity will try to extract a “pivot” from all the xpaths - that is it will try to check for the existence of an xpath that is the longest common prefix. If the pivot exists (you may want to define xpaths in such a way that Curiosity can actually extract the pivot), Curiosity will firstly query the page for all pivot nodes, and then extract the slots starting from that nodes. This way, missed slots will be properly detected.
Curiosity will normally keep the history synchronized with the web source: if a slots instance (e.g. a single news) is deleted from the web source, then it is also deleted from the history.
Anyway, you may want to have slots instances retained in the history even if, at a given execution, they have been detected as removed.
<webSource name="aSource"
historyKeepAll="true">
...
</webSource>
For example, this could be useful if deleted instances will be listed again in the future and we do not want to mark them as new every time.
The historyKeepAll attribute can be applied to other sources too (e.g. to the file system source).