Crawler

The Curiosity crawler is able to browse a web site in a completely customizable and flexible manner, as it is based upon the same techniques used in the screen scraper.

In fact, in the process of crawling, given a retrieved page at a certain level, Curiosity can choose which links should be followed from that page onwards by applying to it some XPaths.

In Curiosity, the crawling consists of steps - starting from 0. When Curiosity access the urlSource of a web source, it is executing the step 0 of the crawling.

If you want to crawl a web site, it follows that the information you need is spreaded throughout a number of pages - and not (only) in the starting one. So, you must say to Curiosity where the information you need is located, and (at least at this moment) you can do this by using levels. You can say that the information must be extracted from all the pages retrieved at a specific level, or that this must happen at every level of the crawling.

So, in this case the slots node of the web source should have a level value strictly greater than 0.

<webSource name="aSource">
    <nextstepsList>
        <slot level="0">
            <xpath>//a[@class = 'Links']</xpath>
        </slot>
    </nextstepsList>

    <slots level="1">
        <slot name="Title">
            ...
        </slot>
    </slots>
</webSource>

After having retrieved the starting page, Curiosity looks between the children of the nextstepsList node of the web source, which is supposed itself to contain a list of slots, each one having associated a level attribute. So, in order to obtain the next steps links from the current page, they are choosen only the slots whose level is 0.

The step 0 ends with a number of urls and optionally with some information extaracted - in the case that you set the information being extracted at every level.

Then we enter step 1, and the algorithm is repeated. In general, if we are at level n, given a set of urls retrieved at level n -1, for each url:

  1. Curiosity looks at  the level value of the slots node: if it is n or it is -1, then the extraciotn is performed for the url, and the information is appended to the data gathered up to now;
  2. Curiosity looks at the children of nextstepsList looking for slots that can be used in order to find out some more urls from the current one.

Of course you can use attribute selection and regular expressions also for the slots that extracts the next steps links, as in the following example:

<webSource name="aSource">
    <nextstepsList>
        <slot level="0">
            <xpath>//a[@class = 'Links']</xpath>
        </slot>
        <slot level="1">
            <xpath>//div/table/tr[4]/td/a</xpath>
            <attr>onclick </attr>
            <regexp>
                <exp>.*"(news.asp\?ID=[0-9]+)".*$</exp>
                <subst>$+</subst>
            </regexp>
        </slot>
    </nextstepsList>
    <slots level="2">
        <slot name="Title">
            ...
        </slot>
    </slots>
</webSource>

This way you can handle, to some extent, links that in the browser are created by means of javascript code.

If you just want to follow all the links in all the web pages of a site, you can simply use a configuration like the following:

<webSource name="aSource">
    <nextstepsList>
        <slot>
            <xpath>//a</xpath>
        </slot>
    </nextstepsList>
    <slots>
        <slot name="...">
            ...
        </slot>
    </slots>
</webSource>

Please, note that in Curiosity Studio you can only set a next step slot for level 0. If you want to specify more sophisticated rules, you must edit the curiosity.xml file by hand (the Studio will retain your changes).

By default Curiosity will not crawl outside the starting url; you can modify this behaviour by setting to true the outside attribute of the nextstepsList tag:

<webSource name="aSource">
    <nextstepsList outside="true">
        <slot >
            <xpath>//a</xpath>
        </slot>
    </nextstepsList>
    ...
</webSource>

In this case it could be useful setting a maximum level:

<webSource name="aSource">
    <nextstepsList maxLevel="10" outside="true">
        <slot >
            <xpath>//a</xpath>
        </slot>
    </nextstepsList>
    ...
</webSource>

next