Settings Quick Reference

You can repeat the 4 steps previously depicted in order to create a source for this page of contacts. In this case, you may want also to define an XPath for retrieving the contacts' pictures. Anyway, not every person listed in that page has a picture associated, so you may have noticed that Curiosty wrongly assign pictures to persons: this behaviour is due to the fact that by default Curiosity uses a position based policy in order to group results of xpath queries: the first image is grouped with the first person name and so on. Nonetheless, if some slots are optional in a source, you can try to check the Slots: Use Pivots option in the source details form. Curiosity will try to extract a “pivot” from all the xpaths - that is it will try to check for the existence of an xpath that is the longest common prefix. If the pivot exists (you may want to define xpaths in such a way that Curiosity can actually extract the pivot), Curiosity will firstly query the page for all pivot nodes, and then extract the slots starting from that nodes. This way, missed slots will be properly detected.

In the Source Details Form, you can also check the following options:

  • Content: XML: the target document is an xml one;
  • Notification: Full Diff: Curiosity will identify not only new items, but also modified and deleted ones, and it will attach a proper status label to the xml representation;
  • History: Keep All: please, check out the explanation here.

Moreover, you can assign a Named Provider from the related list.

If you specify a custom xsl file, it will be used for the Mail provider and for all the assigned named providers.

If you specify a Next Step XPath, it will be used in order to fetch new pages from the starting url: the scraping will be only applied to that set of fetched pages.

Of course, you can have more control of all the Curiosity settings by editing the curiosity.xml file: e.g. you can change the primary key slot (which is by default TitleLink), you can add more slots, you can attach attribute selection and regular expressions, you can define more providers with more specific settings, and you can define very felxible next steps (crawling) rules. See the manual for having a deeper understanding of all the Curiosity facilities.

In any case, don't forget to check the options section in the curiosty.xml: for example, it is useful to set a working address for your smtp server, or to assign a suitable provider to the report collector.

next