org.apache.manifoldcf.agents.interfaces
Interface IIncrementalIngester

All Known Implementing Classes:
IncrementalIngester

public interface IIncrementalIngester

This interface describes the incremental ingestion API. SOME NOTES: The expected client flow for this API is to: 1) Use the API to fetch a document's version. 2) Base a decision whether to ingest based on that version. 3) If the decision to ingest occurs, then the ingest method in the API is called. The module described by this interface is responsible for keeping track of what has been sent where, and also the corresponding version of each document so indexed. The space over which this takes place is defined by the individual output connection - that is, the output connection seems to "remember" what documents were handed to it. A secondary purpose of this module is to provide a mapping between the key by which a document is described internally (by an identifier hash, plus the name of an identifier space), and the way the document is identified in the output space (by the name of an output connection, plus a URI which is considered local to that output connection space).


Field Summary
static java.lang.String _rcsid
           
 
Method Summary
 boolean checkDocumentIndexable(java.lang.String outputConnectionName, java.io.File localFile)
          Check if a file is indexable.
 boolean checkMimeTypeIndexable(java.lang.String outputConnectionName, java.lang.String mimeType)
          Check if a mime type is indexable.
 void clearAll()
          Flush all knowledge of what was ingested before.
 void deinstall()
          Uninstall the incremental ingestion manager.
 void documentCheck(java.lang.String outputConnectionName, java.lang.String identifierClass, java.lang.String identifierHash, long checkTime)
          Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).
 void documentCheckMultiple(java.lang.String outputConnectionName, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, long checkTime)
          Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).
 void documentDelete(java.lang.String outputConnectionName, java.lang.String identifierClass, java.lang.String identifierHash, IOutputRemoveActivity activities)
          Delete a document from the search engine index.
 void documentDeleteMultiple(java.lang.String[] outputConnectionNames, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities)
          Delete multiple documents from the search engine index.
 void documentDeleteMultiple(java.lang.String outputConnectionName, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes, IOutputRemoveActivity activities)
          Delete multiple documents from the search engine index.
 boolean documentIngest(java.lang.String outputConnectionName, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String documentVersion, java.lang.String outputVersion, java.lang.String authorityName, RepositoryDocument data, long ingestTime, java.lang.String documentURI, IOutputActivity activities)
          Ingest a document.
 void documentRecord(java.lang.String outputConnectionName, java.lang.String identifierClass, java.lang.String identifierHash, java.lang.String documentVersion, long recordTime, IOutputActivity activities)
          Record a document version, but don't ingest it.
 DocumentIngestStatus getDocumentIngestData(java.lang.String outputConnectionName, java.lang.String identifierClass, java.lang.String identifierHash)
          Look up ingestion data for a documents.
 DocumentIngestStatus[] getDocumentIngestDataMultiple(java.lang.String[] outputConnectionNames, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)
          Look up ingestion data for a SET of documents.
 DocumentIngestStatus[] getDocumentIngestDataMultiple(java.lang.String outputConnectionName, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)
          Look up ingestion data for a SET of documents.
 long getDocumentUpdateInterval(java.lang.String outputConnectionName, java.lang.String identifierClass, java.lang.String identifierHash)
          Calculate the average time interval between changes for a document.
 long[] getDocumentUpdateIntervalMultiple(java.lang.String outputConnectionName, java.lang.String[] identifierClasses, java.lang.String[] identifierHashes)
          Calculate the average time interval between changes for a document.
 void install()
          Install the incremental ingestion manager.
 void resetOutputConnection(java.lang.String outputConnectionName)
          Reset all documents belonging to a specific output connection, because we've got information that that system has been reconfigured.
 

Field Detail

_rcsid

static final java.lang.String _rcsid
See Also:
Constant Field Values
Method Detail

install

void install()
             throws ManifoldCFException
Install the incremental ingestion manager.

Throws:
ManifoldCFException

deinstall

void deinstall()
               throws ManifoldCFException
Uninstall the incremental ingestion manager.

Throws:
ManifoldCFException

clearAll

void clearAll()
              throws ManifoldCFException
Flush all knowledge of what was ingested before.

Throws:
ManifoldCFException

checkMimeTypeIndexable

boolean checkMimeTypeIndexable(java.lang.String outputConnectionName,
                               java.lang.String mimeType)
                               throws ManifoldCFException,
                                      ServiceInterruption
Check if a mime type is indexable.

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
mimeType - is the mime type to check.
Returns:
true if the mimeType is indexable.
Throws:
ManifoldCFException
ServiceInterruption

checkDocumentIndexable

boolean checkDocumentIndexable(java.lang.String outputConnectionName,
                               java.io.File localFile)
                               throws ManifoldCFException,
                                      ServiceInterruption
Check if a file is indexable.

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
localFile - is the local file to check.
Returns:
true if the local file is indexable.
Throws:
ManifoldCFException
ServiceInterruption

documentRecord

void documentRecord(java.lang.String outputConnectionName,
                    java.lang.String identifierClass,
                    java.lang.String identifierHash,
                    java.lang.String documentVersion,
                    long recordTime,
                    IOutputActivity activities)
                    throws ManifoldCFException,
                           ServiceInterruption
Record a document version, but don't ingest it. The purpose of this method is to keep track of the frequency at which ingestion "attempts" take place. ServiceInterruption is thrown if this action must be rescheduled.

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
identifierClass - is the name of the space in which the identifier hash should be interpreted.
identifierHash - is the hashed document identifier.
documentVersion - is the document version.
recordTime - is the time at which the recording took place, in milliseconds since epoch.
activities - is the object used in case a document needs to be removed from the output index as the result of this operation.
Throws:
ManifoldCFException
ServiceInterruption

documentIngest

boolean documentIngest(java.lang.String outputConnectionName,
                       java.lang.String identifierClass,
                       java.lang.String identifierHash,
                       java.lang.String documentVersion,
                       java.lang.String outputVersion,
                       java.lang.String authorityName,
                       RepositoryDocument data,
                       long ingestTime,
                       java.lang.String documentURI,
                       IOutputActivity activities)
                       throws ManifoldCFException,
                              ServiceInterruption
Ingest a document. This ingests the document, and notes it. If this is a repeat ingestion of the document, this method also REMOVES ALL OLD METADATA. When complete, the index will contain only the metadata described by the RepositoryDocument object passed to this method. ServiceInterruption is thrown if the document ingestion must be rescheduled.

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
identifierClass - is the name of the space in which the identifier hash should be interpreted.
identifierHash - is the hashed document identifier.
documentVersion - is the document version.
outputVersion - is the output version string constructed from the output specification by the output connector.
authorityName - is the name of the authority associated with the document, if any.
data - is the document data. The data is closed after ingestion is complete.
ingestTime - is the time at which the ingestion took place, in milliseconds since epoch.
documentURI - is the URI of the document, which will be used as the key of the document in the index.
activities - is an object providing a set of methods that the implementer can use to perform the operation.
Returns:
true if the ingest was ok, false if the ingest is illegal (and should not be repeated).
Throws:
ManifoldCFException
ServiceInterruption

documentCheckMultiple

void documentCheckMultiple(java.lang.String outputConnectionName,
                           java.lang.String[] identifierClasses,
                           java.lang.String[] identifierHashes,
                           long checkTime)
                           throws ManifoldCFException
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
identifierHashes - are the set of document identifier hashes.
checkTime - is the time at which the check took place, in milliseconds since epoch.
Throws:
ManifoldCFException

documentCheck

void documentCheck(java.lang.String outputConnectionName,
                   java.lang.String identifierClass,
                   java.lang.String identifierHash,
                   long checkTime)
                   throws ManifoldCFException
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed).

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
identifierClass - is the name of the space in which the identifier hash should be interpreted.
identifierHash - is the hashed document identifier.
checkTime - is the time at which the check took place, in milliseconds since epoch.
Throws:
ManifoldCFException

documentDeleteMultiple

void documentDeleteMultiple(java.lang.String[] outputConnectionNames,
                            java.lang.String[] identifierClasses,
                            java.lang.String[] identifierHashes,
                            IOutputRemoveActivity activities)
                            throws ManifoldCFException,
                                   ServiceInterruption
Delete multiple documents from the search engine index.

Parameters:
outputConnectionNames - are the names of the output connections associated with this action.
identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
identifierHashes - is tha array of document identifier hashes if the documents.
activities - is the object to use to log the details of the ingestion attempt. May be null.
Throws:
ManifoldCFException
ServiceInterruption

documentDeleteMultiple

void documentDeleteMultiple(java.lang.String outputConnectionName,
                            java.lang.String[] identifierClasses,
                            java.lang.String[] identifierHashes,
                            IOutputRemoveActivity activities)
                            throws ManifoldCFException,
                                   ServiceInterruption
Delete multiple documents from the search engine index.

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
identifierHashes - is tha array of document identifier hashes if the documents.
activities - is the object to use to log the details of the ingestion attempt. May be null.
Throws:
ManifoldCFException
ServiceInterruption

documentDelete

void documentDelete(java.lang.String outputConnectionName,
                    java.lang.String identifierClass,
                    java.lang.String identifierHash,
                    IOutputRemoveActivity activities)
                    throws ManifoldCFException,
                           ServiceInterruption
Delete a document from the search engine index.

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
identifierClass - is the name of the space in which the identifier hash should be interpreted.
identifierHash - is the hash of the id of the document.
activities - is the object to use to log the details of the ingestion attempt. May be null.
Throws:
ManifoldCFException
ServiceInterruption

getDocumentIngestDataMultiple

DocumentIngestStatus[] getDocumentIngestDataMultiple(java.lang.String[] outputConnectionNames,
                                                     java.lang.String[] identifierClasses,
                                                     java.lang.String[] identifierHashes)
                                                     throws ManifoldCFException
Look up ingestion data for a SET of documents.

Parameters:
outputConnectionNames - are the names of the output connections associated with this action.
identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
identifierHashes - is the array of document identifier hashes to look up.
Returns:
the array of document data. Null will come back for any identifier that doesn't exist in the index.
Throws:
ManifoldCFException

getDocumentIngestDataMultiple

DocumentIngestStatus[] getDocumentIngestDataMultiple(java.lang.String outputConnectionName,
                                                     java.lang.String[] identifierClasses,
                                                     java.lang.String[] identifierHashes)
                                                     throws ManifoldCFException
Look up ingestion data for a SET of documents.

Parameters:
outputConnectionName - is the names of the output connection associated with this action.
identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
identifierHashes - is the array of document identifier hashes to look up.
Returns:
the array of document data. Null will come back for any identifier that doesn't exist in the index.
Throws:
ManifoldCFException

getDocumentIngestData

DocumentIngestStatus getDocumentIngestData(java.lang.String outputConnectionName,
                                           java.lang.String identifierClass,
                                           java.lang.String identifierHash)
                                           throws ManifoldCFException
Look up ingestion data for a documents.

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
identifierClass - is the name of the space in which the identifier hash should be interpreted.
identifierHash - is the hash of the id of the document.
Returns:
the current document's ingestion data, or null if the document is not currently ingested.
Throws:
ManifoldCFException

getDocumentUpdateIntervalMultiple

long[] getDocumentUpdateIntervalMultiple(java.lang.String outputConnectionName,
                                         java.lang.String[] identifierClasses,
                                         java.lang.String[] identifierHashes)
                                         throws ManifoldCFException
Calculate the average time interval between changes for a document. This is based on the data gathered for the document.

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.
identifierHashes - is the hashes of the ids of the documents.
Returns:
the number of milliseconds between changes, or 0 if this cannot be calculated.
Throws:
ManifoldCFException

getDocumentUpdateInterval

long getDocumentUpdateInterval(java.lang.String outputConnectionName,
                               java.lang.String identifierClass,
                               java.lang.String identifierHash)
                               throws ManifoldCFException
Calculate the average time interval between changes for a document. This is based on the data gathered for the document.

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
identifierClass - is the name of the space in which the identifier hash should be interpreted.
identifierHash - is the hash of the id of the document.
Returns:
the number of milliseconds between changes, or 0 if this cannot be calculated.
Throws:
ManifoldCFException

resetOutputConnection

void resetOutputConnection(java.lang.String outputConnectionName)
                           throws ManifoldCFException
Reset all documents belonging to a specific output connection, because we've got information that that system has been reconfigured. This will force all such documents to be reindexed the next time they are checked.

Parameters:
outputConnectionName - is the name of the output connection associated with this action.
Throws:
ManifoldCFException