Xml Data Representation

The data retrieved is represented (as it is passed to the providers) in an xml language using a tabular like schema. This schema has a common part shared within all sources, and a variable one which depends upon the source type.

A typical xml looks like this:

<Root>
  <InfoSet ...>
    <InfoResource .../>
    <InfoResource .../>
    ...
  </InfoSet>
</Root>

The relevant information is found in the InfoResources: e.g., if the data has been extracted from a news site, each InfoResource will be a news item, and it will have - depending on the Information Slots that you defined - a number of attributes for title, date, description...

The InfoSet node has always the schemaName attribute, the name of the source that originated this data, and the label attribute (if applicable).

An InfoResource has always the following attributes:

  • status, which says if this InfoResource is New, Modified of Deleted
  • timestamp, which says in a human readable format (not customizable) the time at which the information has been extracted
  • timestampTicks
  • sourceUrl, which is the url of the document where this InfoResource has been extracted (only for web sources).

The attributes of an InfoResource carrying the actual information depend of the type of the corresponding source.

As said, for the web source (and for the xml source) you will have an attribute for every slot - and the slot name is used as attribute name.

For the filesystem source, see the related documentation.

Please, note that in the case of a Combined Source, the outer InfoSet will contain an InfoSet for each non-void dependency.

next