The Curiosity crawler is able to browse a web site in a completely customizable and flexible manner, as it is based upon the same techniques used in the screen scraper.
In fact, in the process of crawling, given a retrieved page at a certain level, Curiosity can choose which links should be followed from that page onwards by applying to it some XPaths.
In Curiosity, the crawling consists of steps - starting from 0. When Curiosity access the urlSource of a web source, it is executing the step 0 of the crawling.
If you want to crawl a web site, it follows that the information you need is spreaded throughout a number of pages - and not (only) in the starting one. So, you must say to Curiosity where the information you need is located, and (at least at this moment) you can do this by using levels. You can say that the information must be extracted from all the pages retrieved at a specific level, or that this must happen at every level of the crawling.
So, in this case the slots node of the web source should have a level value strictly greater than 0.
<webSource
name="aSource">
<nextstepsList>
<slot
level="0">
<xpath>//a[@class = 'Links']</xpath>
</slot>
</nextstepsList>
<slots
level="1">
<slot
name="Title">
...
</slot>
</slots>
</webSource>
After having retrieved the starting page, Curiosity looks between the children of the nextstepsList node of the web source, which is supposed itself to contain a list of slots, each one having associated a level attribute. So, in order to obtain the next steps links from the current page, they are choosen only the slots whose level is 0.
The step 0 ends with a number of urls and optionally with some information extaracted - in the case that you set the information being extracted at every level.
Then we enter step 1, and the algorithm is repeated. In general, if we are at level n, given a set of urls retrieved at level n -1, for each url:
Of course you can use attribute selection and regular expressions also for the slots that extracts the next steps links, as in the following example:
<webSource
name="aSource">
<nextstepsList>
<slot
level="0">
<xpath>//a[@class = 'Links']</xpath>
</slot>
<slot
level="1">
<xpath>//div/table/tr[4]/td/a</xpath>
<attr>onclick
</attr>
<regexp>
<exp>.*"(news.asp\?ID=[0-9]+)".*$</exp>
<subst>$+</subst>
</regexp>
</slot>
</nextstepsList>
<slots level="2">
<slot
name="Title">
...
</slot>
</slots>
</webSource>
This way you can handle, to some extent, links that in the browser are created by means of javascript code.
If you just want to follow all the links in all the web pages of a site, you can simply use a configuration like the following:
<webSource
name="aSource">
<nextstepsList>
<slot>
<xpath>//a</xpath>
</slot>
</nextstepsList>
<slots>
<slot
name="...">
...
</slot>
</slots>
</webSource>
Please, note that in Curiosity Studio you can only set a next step slot for level 0. If you want to specify more sophisticated rules, you must edit the curiosity.xml file by hand (the Studio will retain your changes).
By default Curiosity will not crawl outside the starting url; you can modify this behaviour by setting to true the outside attribute of the nextstepsList tag:
<webSource
name="aSource">
<nextstepsList
outside="true">
<slot
>
<xpath>//a</xpath>
</slot>
</nextstepsList>
...
</webSource>
In this case it could be useful setting a maximum level:
<webSource
name="aSource">
<nextstepsList
maxLevel="10"
outside="true">
<slot
>
<xpath>//a</xpath>
</slot>
</nextstepsList>
...
</webSource>