The Web source is the main source from which Curiosity can extract information. It is an html page (or, more in general, a web site) from which data is extracted by means of screen scraping.
A web source can be defined in curiosity.xml in the following way:
<curiosity>
<sources>
<webSource
name="sourceName"
urlSource="http://...">
...
</webSource>
</sources>
</curiosity>
The attribute sourceName is mandatory: it must be unique whithin all the sources, and cannot contain neither spaces nor the following list of characters: \, /, :, *, ?, |, ", >, <.
The attribute urlSource is optional: it should be omitted if you are defining a source with generic scraping rules (such as Microformats), and which can be used as a cloner for cloned sources.
You can also define a label attribute in order to better describe the Source (it applies to every type of Source); the label can be later used during the providing actions - typically by accessing it in the xsl transformation.
If Curiosity fails in understanding the encoding charset of a document, you may want to explicitly define it by means of the charset attribute of the webSource tag (e.g. charset="utf-8").
If a web source is defined in such a way that can be applied to a whole set of (similar) html pages, and if exists some url "pattern" which can filter such pages, it is possibile to state such pattern by means of the pattern attribute of the webSource tag. This pattern will be used when the web servicegetMatchingSources(String aUrl) is invoked. Currently, a pattern macthes a url if and only if the url starts with that pattern: for instance, if a source has pattern=http://www.example.com/people it will match urls such as http://www.example.com/people/sales/steven_jobz.htm.
Web sources can be also managed by means of Curiosity Studio, which allows for adding, editing and deleting web sources.
Now you need to learn about:
Or you may want to learn about the other available sources.