org.apache.manifoldcf.crawler.connectors
Class BaseRepositoryConnector

java.lang.Object
  extended by org.apache.manifoldcf.core.connector.BaseConnector
      extended by org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
All Implemented Interfaces:
IConnector, IRepositoryConnector

public abstract class BaseRepositoryConnector
extends BaseConnector
implements IRepositoryConnector

This base class describes an instance of a connection between a repository and ManifoldCF's standard "pull" ingestion agent. Each instance of this interface is used in only one thread at a time. Connection Pooling on these kinds of objects is performed by the factory which instantiates repository connectors from symbolic names and config parameters, and is pooled by these parameters. That is, a pooled connector handle is used only if all the connection parameters for the handle match. Implementers of this interface should provide a default constructor which has this signature: xxx(); Connectors are either configured or not. If configured, they will persist in a pool, and be reused multiple times. Certain methods of a connector may be called before the connector is configured. This includes basically all methods that permit inspection of the connector's capabilities. The complete list is: The purpose of the repository connector is to allow documents to be fetched from the repository. Each repository connector describes a set of documents that are known only to that connector. It therefore establishes a space of document identifiers. Each connector will only ever be asked to deal with identifiers that have in some way originated from the connector. Documents are fetched in three stages. First, the getDocuments() method is called in the connector implementation. This returns a set of document identifiers. The document identifiers are used to obtain the current document version strings in the second stage, using the getDocumentVersions() method. The last stage is processDocuments(), which queues up any additional documents needed, and also ingests. This method will not be called if the document version seems to indicate that no document change took place.


Field Summary
static java.lang.String _rcsid
           
 
Fields inherited from class org.apache.manifoldcf.core.connector.BaseConnector
currentContext, params
 
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
JOBMODE_CONTINUOUS, JOBMODE_ONCEONLY, MODEL_ADD, MODEL_ADD_CHANGE, MODEL_ADD_CHANGE_DELETE, MODEL_ALL, MODEL_PARTIAL
 
Constructor Summary
BaseRepositoryConnector()
           
 
Method Summary
 void addSeedDocuments(ISeedingActivity activities, DocumentSpecification spec, long startTime, long endTime)
          Queue "seed" documents.
 void addSeedDocuments(ISeedingActivity activities, DocumentSpecification spec, long startTime, long endTime, int jobMode)
          Queue "seed" documents.
 java.lang.String[] getActivitiesList()
          Return the list of activities that this connector supports (i.e.
 java.lang.String[] getBinNames(java.lang.String documentIdentifier)
          Get the bin name strings for a document identifier.
 int getConnectorModel()
          Tell the world what model this connector uses for getDocumentIdentifiers().
 IDocumentIdentifierStream getDocumentIdentifiers(DocumentSpecification spec, long startTime, long endTime)
          The short version of getDocumentIdentifiers.
 IDocumentIdentifierStream getDocumentIdentifiers(ISeedingActivity activities, DocumentSpecification spec, long startTime, long endTime)
          The long version of getDocumentIdentifiers.
 java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers, DocumentSpecification spec)
          The short version of getDocumentVersions.
 java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers, IVersionActivity activities, DocumentSpecification spec)
          The long version of getDocumentIdentifiers.
 java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers, java.lang.String[] oldVersions, IVersionActivity activities, DocumentSpecification spec)
          Get document versions given an array of document identifiers.
 java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers, java.lang.String[] oldVersions, IVersionActivity activities, DocumentSpecification spec, int jobMode)
          Get document versions given an array of document identifiers.
 java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers, java.lang.String[] oldVersions, IVersionActivity activities, DocumentSpecification spec, int jobMode, boolean usesDefaultAuthority)
          Get document versions given an array of document identifiers.
 int getMaxDocumentRequest()
          Get the maximum number of documents to amalgamate together into one batch, for this connector.
 java.lang.String[] getRelationshipTypes()
          Return the list of relationship types that this connector recognizes.
 IDocumentIdentifierStream getRemainingDocumentIdentifiers(ISeedingActivity activities, DocumentSpecification spec, long startTime, long endTime)
          This method returns the document identifiers that should be considered part of the seeds, but do not need to be queued for processing at this time.
 void outputSpecificationBody(IHTTPOutput out, DocumentSpecification ds, java.lang.String tabName)
          Output the specification body section.
 void outputSpecificationHeader(IHTTPOutput out, DocumentSpecification ds, java.util.ArrayList tabsArray)
          Output the specification header section.
 void processDocuments(java.lang.String[] documentIdentifiers, java.lang.String[] versions, IProcessActivity activities, DocumentSpecification spec, boolean[] scanOnly)
          Process a set of documents.
 void processDocuments(java.lang.String[] documentIdentifiers, java.lang.String[] versions, IProcessActivity activities, DocumentSpecification spec, boolean[] scanOnly, int jobMode)
          Process a set of documents.
 java.lang.String processSpecificationPost(IPostParameters variableContext, DocumentSpecification ds)
          Process a specification post.
 void releaseDocumentVersions(java.lang.String[] documentIdentifiers, java.lang.String[] versions)
          Free a set of documents.
 boolean requestInfo(Configuration output, java.lang.String command)
          Request arbitrary connector information.
 void viewSpecification(IHTTPOutput out, DocumentSpecification ds)
          View specification.
 
Methods inherited from class org.apache.manifoldcf.core.connector.BaseConnector
check, clearThreadContext, connect, deinstall, disconnect, getConfiguration, install, outputConfigurationBody, outputConfigurationHeader, poll, processConfigurationPost, setThreadContext, viewConfiguration
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.manifoldcf.core.interfaces.IConnector
check, clearThreadContext, connect, deinstall, disconnect, getConfiguration, install, outputConfigurationBody, outputConfigurationHeader, poll, processConfigurationPost, setThreadContext, viewConfiguration
 

Field Detail

_rcsid

public static final java.lang.String _rcsid
See Also:
Constant Field Values
Constructor Detail

BaseRepositoryConnector

public BaseRepositoryConnector()
Method Detail

getConnectorModel

public int getConnectorModel()
Tell the world what model this connector uses for getDocumentIdentifiers(). This must return a model value as specified above.

Specified by:
getConnectorModel in interface IRepositoryConnector
Returns:
the model type value.

getActivitiesList

public java.lang.String[] getActivitiesList()
Return the list of activities that this connector supports (i.e. writes into the log).

Specified by:
getActivitiesList in interface IRepositoryConnector
Returns:
the list.

getRelationshipTypes

public java.lang.String[] getRelationshipTypes()
Return the list of relationship types that this connector recognizes.

Specified by:
getRelationshipTypes in interface IRepositoryConnector
Returns:
the list.

getBinNames

public java.lang.String[] getBinNames(java.lang.String documentIdentifier)
Get the bin name strings for a document identifier. The bin name describes the queue to which the document will be assigned for throttling purposes. Throttling controls the rate at which items in a given queue are fetched; it does not say anything about the overall fetch rate, which may operate on multiple queues or bins. For example, if you implement a web crawler, a good choice of bin name would be the server name, since that is likely to correspond to a real resource that will need real throttle protection.

Specified by:
getBinNames in interface IRepositoryConnector
Parameters:
documentIdentifier - is the document identifier.
Returns:
the set of bin names. If an empty array is returned, it is equivalent to there being no request rate throttling available for this identifier.

requestInfo

public boolean requestInfo(Configuration output,
                           java.lang.String command)
                    throws ManifoldCFException
Request arbitrary connector information. This method is called directly from the API in order to allow API users to perform any one of several connector-specific queries.

Specified by:
requestInfo in interface IRepositoryConnector
Parameters:
output - is the response object, to be filled in by this method.
command - is the command, which is taken directly from the API request.
Returns:
true if the resource is found, false if not. In either case, output may be filled in.
Throws:
ManifoldCFException

addSeedDocuments

public void addSeedDocuments(ISeedingActivity activities,
                             DocumentSpecification spec,
                             long startTime,
                             long endTime,
                             int jobMode)
                      throws ManifoldCFException,
                             ServiceInterruption
Queue "seed" documents. Seed documents are the starting places for crawling activity. Documents are seeded when this method calls appropriate methods in the passed in ISeedingActivity object. This method can choose to find repository changes that happen only during the specified time interval. The seeds recorded by this method will be viewed by the framework based on what the getConnectorModel() method returns. It is not a big problem if the connector chooses to create more seeds than are strictly necessary; it is merely a question of overall work required. The times passed to this method may be interpreted for greatest efficiency. The time ranges any given job uses with this connector will not overlap, but will proceed starting at 0 and going to the "current time", each time the job is run. For continuous crawling jobs, this method will be called once, when the job starts, and at various periodic intervals as the job executes. When a job's specification is changed, the framework automatically resets the seeding start time to 0. The seeding start time may also be set to 0 on each job run, depending on the connector model returned by getConnectorModel(). Note that it is always ok to send MORE documents rather than less to this method.

Specified by:
addSeedDocuments in interface IRepositoryConnector
Parameters:
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is a document specification (that comes from the job).
startTime - is the beginning of the time range to consider, inclusive.
endTime - is the end of the time range to consider, exclusive.
jobMode - is an integer describing how the job is being run, whether continuous or once-only.
Throws:
ManifoldCFException
ServiceInterruption

addSeedDocuments

public void addSeedDocuments(ISeedingActivity activities,
                             DocumentSpecification spec,
                             long startTime,
                             long endTime)
                      throws ManifoldCFException,
                             ServiceInterruption
Queue "seed" documents. Seed documents are the starting places for crawling activity. Documents are seeded when this method calls appropriate methods in the passed in ISeedingActivity object. This method can choose to find repository changes that happen only during the specified time interval. The seeds recorded by this method will be viewed by the framework based on what the getConnectorModel() method returns. It is not a big problem if the connector chooses to create more seeds than are strictly necessary; it is merely a question of overall work required. The times passed to this method may be interpreted for greatest efficiency. The time ranges any given job uses with this connector will not overlap, but will proceed starting at 0 and going to the "current time", each time the job is run. For continuous crawling jobs, this method will be called once, when the job starts, and at various periodic intervals as the job executes. When a job's specification is changed, the framework automatically resets the seeding start time to 0. The seeding start time may also be set to 0 on each job run, depending on the connector model returned by getConnectorModel(). Note that it is always ok to send MORE documents rather than less to this method.

Parameters:
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is a document specification (that comes from the job).
startTime - is the beginning of the time range to consider, inclusive.
endTime - is the end of the time range to consider, exclusive.
Throws:
ManifoldCFException
ServiceInterruption

getDocumentIdentifiers

public IDocumentIdentifierStream getDocumentIdentifiers(ISeedingActivity activities,
                                                        DocumentSpecification spec,
                                                        long startTime,
                                                        long endTime)
                                                 throws ManifoldCFException,
                                                        ServiceInterruption
The long version of getDocumentIdentifiers.

Parameters:
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is a document specification (that comes from the job).
startTime - is the beginning of the time range to consider, inclusive.
endTime - is the end of the time range to consider, exclusive.
Returns:
the local document identifiers that should be added to the queue, as a stream.
Throws:
ManifoldCFException
ServiceInterruption

getDocumentIdentifiers

public IDocumentIdentifierStream getDocumentIdentifiers(DocumentSpecification spec,
                                                        long startTime,
                                                        long endTime)
                                                 throws ManifoldCFException,
                                                        ServiceInterruption
The short version of getDocumentIdentifiers.

Parameters:
spec - is a document specification (that comes from the job).
startTime - is the beginning of the time range to consider, inclusive.
endTime - is the end of the time range to consider, exclusive.
Returns:
the local document identifiers that should be added to the queue, as a stream.
Throws:
ManifoldCFException
ServiceInterruption

getRemainingDocumentIdentifiers

public IDocumentIdentifierStream getRemainingDocumentIdentifiers(ISeedingActivity activities,
                                                                 DocumentSpecification spec,
                                                                 long startTime,
                                                                 long endTime)
                                                          throws ManifoldCFException,
                                                                 ServiceInterruption
This method returns the document identifiers that should be considered part of the seeds, but do not need to be queued for processing at this time. This method is used to keep the hopcount tables up to date. It is allowed to return more identifiers than it strictly needs to, specifically identifiers that were also returned by the getDocumentIdentifiers() method above. However, it must constrain the identifiers it returns by the document specification. This method is only required to do anything if the connector supports hopcount determination (which it should signal by having more than zero legal relationship types returned by the getRelationshipTypes() method.

Parameters:
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is a document specification (that comes from the job).
startTime - is the beginning of the time range that was passed to getDocumentIdentifiers().
endTime - is the end of the time range to passed to getDocumentIdentifiers().
Returns:
the local document identifiers that should be added to the queue, as a stream, or null, if none need to be returned.
Throws:
ManifoldCFException
ServiceInterruption

getDocumentVersions

public java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers,
                                              java.lang.String[] oldVersions,
                                              IVersionActivity activities,
                                              DocumentSpecification spec,
                                              int jobMode,
                                              boolean usesDefaultAuthority)
                                       throws ManifoldCFException,
                                              ServiceInterruption
Get document versions given an array of document identifiers. This method is called for EVERY document that is considered. It is therefore important to perform as little work as possible here.

Specified by:
getDocumentVersions in interface IRepositoryConnector
Parameters:
documentIdentifiers - is the array of local document identifiers, as understood by this connector.
oldVersions - is the corresponding array of version strings that have been saved for the document identifiers. A null value indicates that this is a first-time fetch, while an empty string indicates that the previous document had an empty version string.
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is the current document specification for the current job. If there is a dependency on this specification, then the version string should include the pertinent data, so that reingestion will occur when the specification changes. This is primarily useful for metadata.
jobMode - is an integer describing how the job is being run, whether continuous or once-only.
usesDefaultAuthority - will be true only if the authority in use for these documents is the default one.
Returns:
the corresponding version strings, with null in the places where the document no longer exists. Empty version strings indicate that there is no versioning ability for the corresponding document, and the document will always be processed.
Throws:
ManifoldCFException
ServiceInterruption

getDocumentVersions

public java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers,
                                              java.lang.String[] oldVersions,
                                              IVersionActivity activities,
                                              DocumentSpecification spec,
                                              int jobMode)
                                       throws ManifoldCFException,
                                              ServiceInterruption
Get document versions given an array of document identifiers. This method is called for EVERY document that is considered. It is therefore important to perform as little work as possible here.

Parameters:
documentIdentifiers - is the array of local document identifiers, as understood by this connector.
oldVersions - is the corresponding array of version strings that have been saved for the document identifiers. A null value indicates that this is a first-time fetch, while an empty string indicates that the previous document had an empty version string.
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is the current document specification for the current job. If there is a dependency on this specification, then the version string should include the pertinent data, so that reingestion will occur when the specification changes. This is primarily useful for metadata.
jobMode - is an integer describing how the job is being run, whether continuous or once-only.
Returns:
the corresponding version strings, with null in the places where the document no longer exists. Empty version strings indicate that there is no versioning ability for the corresponding document, and the document will always be processed.
Throws:
ManifoldCFException
ServiceInterruption

getDocumentVersions

public java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers,
                                              java.lang.String[] oldVersions,
                                              IVersionActivity activities,
                                              DocumentSpecification spec)
                                       throws ManifoldCFException,
                                              ServiceInterruption
Get document versions given an array of document identifiers. This method is called for EVERY document that is considered. It is therefore important to perform as little work as possible here.

Parameters:
documentIdentifiers - is the array of local document identifiers, as understood by this connector.
oldVersions - is the corresponding array of version strings that have been saved for the document identifiers. A null value indicates that this is a first-time fetch, while an empty string indicates that the previous document had an empty version string.
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is the current document specification for the current job. If there is a dependency on this specification, then the version string should include the pertinent data, so that reingestion will occur when the specification changes. This is primarily useful for metadata.
Returns:
the corresponding version strings, with null in the places where the document no longer exists. Empty version strings indicate that there is no versioning ability for the corresponding document, and the document will always be processed.
Throws:
ManifoldCFException
ServiceInterruption

getDocumentVersions

public java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers,
                                              IVersionActivity activities,
                                              DocumentSpecification spec)
                                       throws ManifoldCFException,
                                              ServiceInterruption
The long version of getDocumentIdentifiers. Get document versions given an array of document identifiers. This method is called for EVERY document that is considered. It is therefore important to perform as little work as possible here.

Parameters:
documentIdentifiers - is the array of local document identifiers, as understood by this connector.
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is the current document specification for the current job. If there is a dependency on this specification, then the version string should include the pertinent data, so that reingestion will occur when the specification changes. This is primarily useful for metadata.
Returns:
the corresponding version strings, with null in the places where the document no longer exists. Empty version strings indicate that there is no versioning ability for the corresponding document, and the document will always be processed.
Throws:
ManifoldCFException
ServiceInterruption

getDocumentVersions

public java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers,
                                              DocumentSpecification spec)
                                       throws ManifoldCFException,
                                              ServiceInterruption
The short version of getDocumentVersions. Get document versions given an array of document identifiers. This method is called for EVERY document that is considered. It is therefore important to perform as little work as possible here.

Parameters:
documentIdentifiers - is the array of local document identifiers, as understood by this connector.
spec - is the current document specification for the current job. If there is a dependency on this specification, then the version string should include the pertinent data, so that reingestion will occur when the specification changes. This is primarily useful for metadata.
Returns:
the corresponding version strings, with null in the places where the document no longer exists. Empty version strings indicate that there is no versioning ability for the corresponding document, and the document will always be processed.
Throws:
ManifoldCFException
ServiceInterruption

releaseDocumentVersions

public void releaseDocumentVersions(java.lang.String[] documentIdentifiers,
                                    java.lang.String[] versions)
                             throws ManifoldCFException
Free a set of documents. This method is called for all documents whose versions have been fetched using the getDocumentVersions() method, including those that returned null versions. It may be used to free resources committed during the getDocumentVersions() method. It is guaranteed to be called AFTER any calls to processDocuments() for the documents in question.

Specified by:
releaseDocumentVersions in interface IRepositoryConnector
Parameters:
documentIdentifiers - is the set of document identifiers.
versions - is the corresponding set of version identifiers (individual identifiers may be null).
Throws:
ManifoldCFException

getMaxDocumentRequest

public int getMaxDocumentRequest()
Get the maximum number of documents to amalgamate together into one batch, for this connector.

Specified by:
getMaxDocumentRequest in interface IRepositoryConnector
Returns:
the maximum number. 0 indicates "unlimited".

processDocuments

public void processDocuments(java.lang.String[] documentIdentifiers,
                             java.lang.String[] versions,
                             IProcessActivity activities,
                             DocumentSpecification spec,
                             boolean[] scanOnly,
                             int jobMode)
                      throws ManifoldCFException,
                             ServiceInterruption
Process a set of documents. This is the method that should cause each document to be fetched, processed, and the results either added to the queue of documents for the current job, and/or entered into the incremental ingestion manager. The document specification allows this class to filter what is done based on the job.

Specified by:
processDocuments in interface IRepositoryConnector
Parameters:
documentIdentifiers - is the set of document identifiers to process.
versions - is the corresponding document versions to process, as returned by getDocumentVersions() above. The implementation may choose to ignore this parameter and always process the current version.
activities - is the interface this method should use to queue up new document references and ingest documents.
spec - is the document specification.
scanOnly - is an array corresponding to the document identifiers. It is set to true to indicate when the processing should only find other references, and should not actually call the ingestion methods.
jobMode - is an integer describing how the job is being run, whether continuous or once-only.
Throws:
ManifoldCFException
ServiceInterruption

processDocuments

public void processDocuments(java.lang.String[] documentIdentifiers,
                             java.lang.String[] versions,
                             IProcessActivity activities,
                             DocumentSpecification spec,
                             boolean[] scanOnly)
                      throws ManifoldCFException,
                             ServiceInterruption
Process a set of documents. This is the method that should cause each document to be fetched, processed, and the results either added to the queue of documents for the current job, and/or entered into the incremental ingestion manager. The document specification allows this class to filter what is done based on the job.

Parameters:
documentIdentifiers - is the set of document identifiers to process.
versions - is the corresponding document versions to process, as returned by getDocumentVersions() above. The implementation may choose to ignore this parameter and always process the current version.
activities - is the interface this method should use to queue up new document references and ingest documents.
spec - is the document specification.
scanOnly - is an array corresponding to the document identifiers. It is set to true to indicate when the processing should only find other references, and should not actually call the ingestion methods.
Throws:
ManifoldCFException
ServiceInterruption

outputSpecificationHeader

public void outputSpecificationHeader(IHTTPOutput out,
                                      DocumentSpecification ds,
                                      java.util.ArrayList tabsArray)
                               throws ManifoldCFException,
                                      java.io.IOException
Output the specification header section. This method is called in the head section of a job page which has selected a repository connection of the current type. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the job editing HTML.

Specified by:
outputSpecificationHeader in interface IRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
Throws:
ManifoldCFException
java.io.IOException

outputSpecificationBody

public void outputSpecificationBody(IHTTPOutput out,
                                    DocumentSpecification ds,
                                    java.lang.String tabName)
                             throws ManifoldCFException,
                                    java.io.IOException
Output the specification body section. This method is called in the body section of a job page which has selected a repository connection of the current type. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate , , and
tags. The name of the form is "editjob".

Specified by:
outputSpecificationBody in interface IRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
tabName - is the current tab name.
Throws:
ManifoldCFException
java.io.IOException

processSpecificationPost

public java.lang.String processSpecificationPost(IPostParameters variableContext,
                                                 DocumentSpecification ds)
                                          throws ManifoldCFException
Process a specification post. This method is called at the start of job's edit or view page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the document specification accordingly. The name of the posted form is "editjob".

Specified by:
processSpecificationPost in interface IRepositoryConnector
Parameters:
variableContext - contains the post data, including binary file-upload information.
ds - is the current document specification for this job.
Returns:
null if all is well, or a string error message if there is an error that should prevent saving of the job (and cause a redirection to an error page).
Throws:
ManifoldCFException

viewSpecification

public void viewSpecification(IHTTPOutput out,
                              DocumentSpecification ds)
                       throws ManifoldCFException,
                              java.io.IOException
View specification. This method is called in the body section of a job's view page. Its purpose is to present the document specification information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate and tags.

Specified by:
viewSpecification in interface IRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
Throws:
ManifoldCFException
java.io.IOException