org.apache.manifoldcf.crawler.interfaces
Interface IRepositoryConnector

All Superinterfaces:
IConnector
All Known Implementing Classes:
BaseRepositoryConnector

public interface IRepositoryConnector
extends IConnector

This interface describes an instance of a connection between a repository and ManifoldCF's standard "pull" ingestion agent. Each instance of this interface is used in only one thread at a time. Connection Pooling on these kinds of objects is performed by the factory which instantiates repository connectors from symbolic names and config parameters, and is pooled by these parameters. That is, a pooled connector handle is used only if all the connection parameters for the handle match. Implementers of this interface should provide a default constructor which has this signature: xxx(); Connectors are either configured or not. If configured, they will persist in a pool, and be reused multiple times. Certain methods of a connector may be called before the connector is configured. This includes basically all methods that permit inspection of the connector's capabilities. The complete list is: The purpose of the repository connector is to allow documents to be fetched from the repository. Each repository connector describes a set of documents that are known only to that connector. It therefore establishes a space of document identifiers. Each connector will only ever be asked to deal with identifiers that have in some way originated from the connector. Documents are fetched by ManifoldCF in three stages. First, the addSeedDocuments() method is called in the connector implementation. This method is meant to add a set of document identifiers to the queue. When ManifoldCF is ready to process a document, the document identifier is used to obtain a current document version string, using the getDocumentVersions() method (the second stage). This version string is used to decide whether or not the third stage need be called for the document or not. The third stage is responsible for sending document content to the output, and for extracting any references to additional documents, and consists of the processDocuments() method. All of these methods interact with ManifoldCF by means of an "activity" interface. For example, an IVersionActivity object is passed to the getDocumentVersions() method, and that object contains methods that are necessary for getDocumentVersions() to do its job. A similar architecture is used throughout the connector framework.


Field Summary
static java.lang.String _rcsid
           
static int JOBMODE_CONTINUOUS
           
static int JOBMODE_ONCEONLY
           
static int MODEL_ADD
          Supply at least the documents that have been added since the specified start time.
static int MODEL_ADD_CHANGE
          Supply at least the documents that have been added or changed within the specified time range.
static int MODEL_ADD_CHANGE_DELETE
          Supply at least the documents that have been added, changed, or deleted within the specified time range.
static int MODEL_ALL
          Supply all seeds every time.
static int MODEL_PARTIAL
          This indicates that the seeds are never complete; the previous seeds are lost and cannot be retrieved.
 
Method Summary
 void addSeedDocuments(ISeedingActivity activities, DocumentSpecification spec, long startTime, long endTime, int jobMode)
          Queue "seed" documents.
 java.lang.String[] getActivitiesList()
          Return the list of activities that this connector supports (i.e.
 java.lang.String[] getBinNames(java.lang.String documentIdentifier)
          Get the bin name strings for a document identifier.
 int getConnectorModel()
          Tell the world what model this connector uses for addSeedDocuments().
 java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers, java.lang.String[] oldVersions, IVersionActivity activities, DocumentSpecification spec, int jobMode, boolean usesDefaultAuthority)
          Get document versions given an array of document identifiers.
 int getMaxDocumentRequest()
          Get the maximum number of documents to amalgamate together into one batch, for this connector.
 java.lang.String[] getRelationshipTypes()
          Return the list of relationship types that this connector recognizes.
 void outputSpecificationBody(IHTTPOutput out, DocumentSpecification ds, java.lang.String tabName)
          Output the specification body section.
 void outputSpecificationHeader(IHTTPOutput out, DocumentSpecification ds, java.util.ArrayList tabsArray)
          Output the specification header section.
 void processDocuments(java.lang.String[] documentIdentifiers, java.lang.String[] versions, IProcessActivity activities, DocumentSpecification spec, boolean[] scanOnly, int jobMode)
          Process a set of documents.
 java.lang.String processSpecificationPost(IPostParameters variableContext, DocumentSpecification ds)
          Process a specification post.
 void releaseDocumentVersions(java.lang.String[] documentIdentifiers, java.lang.String[] versions)
          Free a set of documents.
 boolean requestInfo(Configuration output, java.lang.String command)
          Request arbitrary connector information.
 void viewSpecification(IHTTPOutput out, DocumentSpecification ds)
          View specification.
 
Methods inherited from interface org.apache.manifoldcf.core.interfaces.IConnector
check, clearThreadContext, connect, deinstall, disconnect, getConfiguration, install, outputConfigurationBody, outputConfigurationHeader, poll, processConfigurationPost, setThreadContext, viewConfiguration
 

Field Detail

_rcsid

static final java.lang.String _rcsid
See Also:
Constant Field Values

MODEL_ALL

static final int MODEL_ALL
Supply all seeds every time. The connector does not pay any attention to the start time or end time of the request, and simply returns a complete list of seeds.

See Also:
Constant Field Values

MODEL_ADD

static final int MODEL_ADD
Supply at least the documents that have been added since the specified start time. Connector is aware of the start time and end time of the request, and supplies at least the documents that have been added within the specified time range.

See Also:
Constant Field Values

MODEL_ADD_CHANGE

static final int MODEL_ADD_CHANGE
Supply at least the documents that have been added or changed within the specified time range.

See Also:
Constant Field Values

MODEL_ADD_CHANGE_DELETE

static final int MODEL_ADD_CHANGE_DELETE
Supply at least the documents that have been added, changed, or deleted within the specified time range.

See Also:
Constant Field Values

MODEL_PARTIAL

static final int MODEL_PARTIAL
This indicates that the seeds are never complete; the previous seeds are lost and cannot be retrieved.

See Also:
Constant Field Values

JOBMODE_ONCEONLY

static final int JOBMODE_ONCEONLY
See Also:
Constant Field Values

JOBMODE_CONTINUOUS

static final int JOBMODE_CONTINUOUS
See Also:
Constant Field Values
Method Detail

getConnectorModel

int getConnectorModel()
Tell the world what model this connector uses for addSeedDocuments(). This must return a model value as specified above. The connector does not have to be connected for this method to be called.

Returns:
the model type value.

getActivitiesList

java.lang.String[] getActivitiesList()
Return the list of activities that this connector supports (i.e. writes into the log). The connector does not have to be connected for this method to be called.

Returns:
the list.

getRelationshipTypes

java.lang.String[] getRelationshipTypes()
Return the list of relationship types that this connector recognizes. The connector does not need to be connected for this method to be called.

Returns:
the list.

getBinNames

java.lang.String[] getBinNames(java.lang.String documentIdentifier)
Get the bin name strings for a document identifier. The bin name describes the queue to which the document will be assigned for throttling purposes. Throttling controls the rate at which items in a given queue are fetched; it does not say anything about the overall fetch rate, which may operate on multiple queues or bins. For example, if you implement a web crawler, a good choice of bin name would be the server name, since that is likely to correspond to a real resource that will need real throttle protection. The connector must be connected for this method to be called.

Parameters:
documentIdentifier - is the document identifier.
Returns:
the set of bin names. If an empty array is returned, it is equivalent to there being no request rate throttling available for this identifier.

requestInfo

boolean requestInfo(Configuration output,
                    java.lang.String command)
                    throws ManifoldCFException
Request arbitrary connector information. This method is called directly from the API in order to allow API users to perform any one of several connector-specific queries. These are usually used to create external UI's. The connector will be connected before this method is called.

Parameters:
output - is the response object, to be filled in by this method.
command - is the command, which is taken directly from the API request.
Returns:
true if the resource is found, false if not. In either case, output may be filled in.
Throws:
ManifoldCFException

addSeedDocuments

void addSeedDocuments(ISeedingActivity activities,
                      DocumentSpecification spec,
                      long startTime,
                      long endTime,
                      int jobMode)
                      throws ManifoldCFException,
                             ServiceInterruption
Queue "seed" documents. Seed documents are the starting places for crawling activity. Documents are seeded when this method calls appropriate methods in the passed in ISeedingActivity object. This method can choose to find repository changes that happen only during the specified time interval. The seeds recorded by this method will be viewed by the framework based on what the getConnectorModel() method returns. It is not a big problem if the connector chooses to create more seeds than are strictly necessary; it is merely a question of overall work required. The times passed to this method may be interpreted for greatest efficiency. The time ranges any given job uses with this connector will not overlap, but will proceed starting at 0 and going to the "current time", each time the job is run. For continuous crawling jobs, this method will be called once, when the job starts, and at various periodic intervals as the job executes. When a job's specification is changed, the framework automatically resets the seeding start time to 0. The seeding start time may also be set to 0 on each job run, depending on the connector model returned by getConnectorModel(). Note that it is always ok to send MORE documents rather than less to this method. The connector will be connected before this method can be called.

Parameters:
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is a document specification (that comes from the job).
startTime - is the beginning of the time range to consider, inclusive.
endTime - is the end of the time range to consider, exclusive.
jobMode - is an integer describing how the job is being run, whether continuous or once-only.
Throws:
ManifoldCFException
ServiceInterruption

getDocumentVersions

java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers,
                                       java.lang.String[] oldVersions,
                                       IVersionActivity activities,
                                       DocumentSpecification spec,
                                       int jobMode,
                                       boolean usesDefaultAuthority)
                                       throws ManifoldCFException,
                                              ServiceInterruption
Get document versions given an array of document identifiers. This method is called for EVERY document that is considered. It is therefore important to perform as little work as possible here. The connector will be connected before this method can be called.

Parameters:
documentIdentifiers - is the array of local document identifiers, as understood by this connector.
oldVersions - is the corresponding array of version strings that have been saved for the document identifiers. A null value indicates that this is a first-time fetch, while an empty string indicates that the previous document had an empty version string.
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is the current document specification for the current job. If there is a dependency on this specification, then the version string should include the pertinent data, so that reingestion will occur when the specification changes. This is primarily useful for metadata.
jobMode - is an integer describing how the job is being run, whether continuous or once-only.
usesDefaultAuthority - will be true only if the authority in use for these documents is the default one.
Returns:
the corresponding version strings, with null in the places where the document no longer exists. Empty version strings indicate that there is no versioning ability for the corresponding document, and the document will always be processed.
Throws:
ManifoldCFException
ServiceInterruption

processDocuments

void processDocuments(java.lang.String[] documentIdentifiers,
                      java.lang.String[] versions,
                      IProcessActivity activities,
                      DocumentSpecification spec,
                      boolean[] scanOnly,
                      int jobMode)
                      throws ManifoldCFException,
                             ServiceInterruption
Process a set of documents. This is the method that should cause each document to be fetched, processed, and the results either added to the queue of documents for the current job, and/or entered into the incremental ingestion manager. The document specification allows this class to filter what is done based on the job. The connector will be connected before this method can be called.

Parameters:
documentIdentifiers - is the set of document identifiers to process.
versions - is the corresponding document versions to process, as returned by getDocumentVersions() above. The implementation may choose to ignore this parameter and always process the current version.
activities - is the interface this method should use to queue up new document references and ingest documents.
spec - is the document specification.
scanOnly - is an array corresponding to the document identifiers. It is set to true to indicate when the processing should only find other references, and should not actually call the ingestion methods.
jobMode - is an integer describing how the job is being run, whether continuous or once-only.
Throws:
ManifoldCFException
ServiceInterruption

releaseDocumentVersions

void releaseDocumentVersions(java.lang.String[] documentIdentifiers,
                             java.lang.String[] versions)
                             throws ManifoldCFException
Free a set of documents. This method is called for all documents whose versions have been fetched using the getDocumentVersions() method, including those that returned null versions. It may be used to free resources committed during the getDocumentVersions() method. It is guaranteed to be called AFTER any calls to processDocuments() for the documents in question. The connector will be connected before this method can be called.

Parameters:
documentIdentifiers - is the set of document identifiers.
versions - is the corresponding set of version identifiers (individual identifiers may be null).
Throws:
ManifoldCFException

getMaxDocumentRequest

int getMaxDocumentRequest()
Get the maximum number of documents to amalgamate together into one batch, for this connector. The connector does not need to be connected for this method to be called.

Returns:
the maximum number. 0 indicates "unlimited".

outputSpecificationHeader

void outputSpecificationHeader(IHTTPOutput out,
                               DocumentSpecification ds,
                               java.util.ArrayList tabsArray)
                               throws ManifoldCFException,
                                      java.io.IOException
Output the specification header section. This method is called in the head section of a job page which has selected a repository connection of the current type. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the job editing HTML. The connector will be connected before this method can be called.

Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
Throws:
ManifoldCFException
java.io.IOException

outputSpecificationBody

void outputSpecificationBody(IHTTPOutput out,
                             DocumentSpecification ds,
                             java.lang.String tabName)
                             throws ManifoldCFException,
                                    java.io.IOException
Output the specification body section. This method is called in the body section of a job page which has selected a repository connection of the current type. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate , , and
tags. The name of the form is always "editjob". The connector will be connected before this method can be called.

Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
tabName - is the current tab name.
Throws:
ManifoldCFException
java.io.IOException

processSpecificationPost

java.lang.String processSpecificationPost(IPostParameters variableContext,
                                          DocumentSpecification ds)
                                          throws ManifoldCFException
Process a specification post. This method is called at the start of job's edit or view page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the document specification accordingly. The name of the posted form is always "editjob". The connector will be connected before this method can be called.

Parameters:
variableContext - contains the post data, including binary file-upload information.
ds - is the current document specification for this job.
Returns:
null if all is well, or a string error message if there is an error that should prevent saving of the job (and cause a redirection to an error page).
Throws:
ManifoldCFException

viewSpecification

void viewSpecification(IHTTPOutput out,
                       DocumentSpecification ds)
                       throws ManifoldCFException,
                              java.io.IOException
View specification. This method is called in the body section of a job's view page. Its purpose is to present the document specification information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate and tags. The connector will be connected before this method can be called.

Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
Throws:
ManifoldCFException
java.io.IOException