|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.apache.manifoldcf.core.database.BaseTable
org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester
public class IncrementalIngester
Incremental ingestion API implementation. This class is responsible for keeping track of what has been sent where, and also the corresponding version of each document so indexed. The space over which this takes place is defined by the individual output connection - that is, the output connection seems to "remember" what documents were handed to it. A secondary purpose of this module is to provide a mapping between the key by which a document is described internally (by an identifier hash, plus the name of an identifier space), and the way the document is identified in the output space (by the name of an output connection, plus a URI which is considered local to that output connection space).
| Nested Class Summary | |
|---|---|
protected static class |
IncrementalIngester.DeleteInfo
This class contains the information necessary to delete a document |
| Field Summary | |
|---|---|
static java.lang.String |
_rcsid
|
protected static java.lang.String |
authorityNameField
|
protected static java.lang.String |
changeCountField
|
protected IOutputConnectionManager |
connectionManager
|
protected static java.lang.String |
docKeyField
|
protected static java.lang.String |
docURIField
|
protected static java.lang.String |
firstIngestField
|
protected static java.lang.String |
idField
|
protected static java.lang.String |
lastIngestField
|
protected static java.lang.String |
lastOutputVersionField
|
protected static java.lang.String |
lastVersionField
|
protected ILockManager |
lockManager
|
protected static java.lang.String |
outputConnNameField
|
protected IThreadContext |
threadContext
|
protected static java.lang.String |
uriHashField
|
| Fields inherited from class org.apache.manifoldcf.core.database.BaseTable |
|---|
dbInterface, tableName |
| Constructor Summary | |
|---|---|
IncrementalIngester(IThreadContext threadContext,
IDBInterface database)
Constructor. |
|
| Method Summary | |
|---|---|
protected int |
addOrReplaceDocument(IOutputConnection connection,
java.lang.String documentURI,
java.lang.String outputDescription,
RepositoryDocument document,
java.lang.String authorityNameString,
IOutputAddActivity activities)
Add or replace document, using the specified output connection, via the standard pool. |
boolean |
checkDocumentIndexable(java.lang.String outputConnectionName,
java.io.File localFile)
Check if a file is indexable. |
boolean |
checkMimeTypeIndexable(java.lang.String outputConnectionName,
java.lang.String mimeType)
Check if a mime type is indexable. |
void |
clearAll()
Flush all knowledge of what was ingested before. |
void |
deinstall()
Uninstall the incremental ingestion manager. |
protected void |
deleteRowIds(java.util.ArrayList list,
java.lang.String queryPart)
Delete a chunk of row ids. |
void |
documentCheck(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash,
long checkTime)
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed). |
void |
documentCheckMultiple(java.lang.String outputConnectionName,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes,
long checkTime)
Note the fact that we checked a document (and found that it did not need to be ingested, because the versions agreed). |
void |
documentDelete(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash,
IOutputRemoveActivity activities)
Delete a document from the search engine index. |
void |
documentDeleteMultiple(java.lang.String[] outputConnectionNames,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes,
IOutputRemoveActivity activities)
Delete multiple documents from the search engine index. |
void |
documentDeleteMultiple(java.lang.String outputConnectionName,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes,
IOutputRemoveActivity activities)
Delete multiple documents from the search engine index. |
boolean |
documentIngest(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash,
java.lang.String documentVersion,
java.lang.String outputVersion,
java.lang.String authorityName,
RepositoryDocument data,
long ingestTime,
java.lang.String documentURI,
IOutputActivity activities)
Ingest a document. |
void |
documentRecord(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash,
java.lang.String documentVersion,
long recordTime,
IOutputActivity activities)
Record a document version, but don't ingest it. |
protected void |
findRowIdsForDocIds(java.lang.String outputConnectionName,
java.util.HashMap rowIDSet,
java.util.ArrayList paramValues,
java.lang.String paramList)
Given values and parameters corresponding to a set of hash values, add corresponding table row id's to the output map. |
protected void |
findRowIdsForURIs(java.lang.String outputConnectionName,
java.util.HashMap rowIDSet,
java.util.HashMap uris,
java.util.ArrayList hashParamValues,
java.lang.String paramList)
Given values and parameters corresponding to a set of hash values, add corresponding table row id's to the output map. |
DocumentIngestStatus |
getDocumentIngestData(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash)
Look up ingestion data for a documents. |
protected void |
getDocumentIngestDataChunk(DocumentIngestStatus[] rval,
java.util.Map map,
java.lang.String outputConnectionName,
java.lang.String clause,
java.util.ArrayList list)
Get a chunk of document ingest data records. |
DocumentIngestStatus[] |
getDocumentIngestDataMultiple(java.lang.String[] outputConnectionNames,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes)
Look up ingestion data for a SET of documents. |
DocumentIngestStatus[] |
getDocumentIngestDataMultiple(java.lang.String outputConnectionName,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes)
Look up ingestion data for a SET of documents. |
long |
getDocumentUpdateInterval(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash)
Calculate the average time interval between changes for a document. |
long[] |
getDocumentUpdateIntervalMultiple(java.lang.String outputConnectionName,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes)
Calculate the average time interval between changes for a document. |
protected void |
getDocumentURIChunk(IncrementalIngester.DeleteInfo[] rval,
java.util.Map map,
java.lang.String outputConnectionName,
java.lang.String clause,
java.util.ArrayList list)
Get a chunk of document uris. |
protected IncrementalIngester.DeleteInfo[] |
getDocumentURIMultiple(java.lang.String outputConnectionName,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes)
Find out what URIs a SET of document URIs are currently ingested. |
protected void |
getIntervals(long[] rval,
java.lang.String outputConnectionName,
java.util.ArrayList list,
java.lang.String queryPart,
java.util.HashMap returnMap)
Query for and calculate the interval for a bunch of hashcodes. |
void |
install()
Install the incremental ingestion manager. |
protected static java.lang.String |
makeKey(java.lang.String documentClass,
java.lang.String documentHash)
Make a key from a document class and a hash |
protected void |
noteDocumentIngest(java.lang.String outputConnectionName,
java.lang.String docKey,
java.lang.String documentVersion,
java.lang.String outputVersion,
java.lang.String authorityNameString,
long ingestTime,
java.lang.String documentURI,
java.lang.String documentURIHash)
Note the ingestion of a document, or the "update" of a document. |
protected boolean |
performIngestion(IOutputConnection connection,
java.lang.String docKey,
java.lang.String documentVersion,
java.lang.String outputVersion,
java.lang.String authorityNameString,
RepositoryDocument data,
long ingestTime,
java.lang.String documentURI,
IOutputActivity activities)
Do the actual ingestion, or just record it if there's nothing to ingest. |
protected void |
removeDocument(IOutputConnection connection,
java.lang.String documentURI,
java.lang.String outputDescription,
IOutputRemoveActivity activities)
Remove document, using the specified output connection, via the standard pool. |
void |
resetOutputConnection(java.lang.String outputConnectionName)
Reset all documents belonging to a specific output connection, because we've got information that that system has been reconfigured. |
protected void |
updateRowIds(java.util.ArrayList list,
java.lang.String queryPart,
long checkTime)
Update a chunk of row ids. |
| Methods inherited from class org.apache.manifoldcf.core.database.BaseTable |
|---|
addTableIndex, analyzeTable, beginTransaction, constructDistinctOnClause, constructOffsetLimitClause, constructRegexpClause, constructSubstringClause, endTransaction, getDatabaseCacheKey, getDBInterface, getMaxInClause, getMaxOrClause, getTableIndexes, getTableName, getTableSchema, getTransactionID, makeTableKey, noteModifications, performAddIndex, performAlter, performCreate, performDelete, performDrop, performInsert, performLock, performModification, performQuery, performQuery, performRemoveIndex, performUpdate, prepareRowForSave, readRow, reindexTable, signalRollback |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final java.lang.String _rcsid
protected static final java.lang.String idField
protected static final java.lang.String outputConnNameField
protected static final java.lang.String docKeyField
protected static final java.lang.String docURIField
protected static final java.lang.String uriHashField
protected static final java.lang.String lastVersionField
protected static final java.lang.String lastOutputVersionField
protected static final java.lang.String changeCountField
protected static final java.lang.String firstIngestField
protected static final java.lang.String lastIngestField
protected static final java.lang.String authorityNameField
protected IThreadContext threadContext
protected ILockManager lockManager
protected IOutputConnectionManager connectionManager
| Constructor Detail |
|---|
public IncrementalIngester(IThreadContext threadContext,
IDBInterface database)
throws ManifoldCFException
ManifoldCFException| Method Detail |
|---|
public void install()
throws ManifoldCFException
install in interface IIncrementalIngesterManifoldCFException
public void deinstall()
throws ManifoldCFException
deinstall in interface IIncrementalIngesterManifoldCFException
public void clearAll()
throws ManifoldCFException
clearAll in interface IIncrementalIngesterManifoldCFException
public boolean checkMimeTypeIndexable(java.lang.String outputConnectionName,
java.lang.String mimeType)
throws ManifoldCFException,
ServiceInterruption
checkMimeTypeIndexable in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.mimeType - is the mime type to check.
ManifoldCFException
ServiceInterruption
public boolean checkDocumentIndexable(java.lang.String outputConnectionName,
java.io.File localFile)
throws ManifoldCFException,
ServiceInterruption
checkDocumentIndexable in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.localFile - is the local file to check.
ManifoldCFException
ServiceInterruption
public void documentRecord(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash,
java.lang.String documentVersion,
long recordTime,
IOutputActivity activities)
throws ManifoldCFException,
ServiceInterruption
documentRecord in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.identifierClass - is the name of the space in which the identifier hash should be interpreted.identifierHash - is the hashed document identifier.documentVersion - is the document version.recordTime - is the time at which the recording took place, in milliseconds since epoch.activities - is the object used in case a document needs to be removed from the output index as the result of this operation.
ManifoldCFException
ServiceInterruption
public boolean documentIngest(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash,
java.lang.String documentVersion,
java.lang.String outputVersion,
java.lang.String authorityName,
RepositoryDocument data,
long ingestTime,
java.lang.String documentURI,
IOutputActivity activities)
throws ManifoldCFException,
ServiceInterruption
documentIngest in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.identifierClass - is the name of the space in which the identifier hash should be interpreted.identifierHash - is the hashed document identifier.documentVersion - is the document version.outputVersion - is the output version string constructed from the output specification by the output connector.authorityName - is the name of the authority associated with the document, if any.data - is the document data. The data is closed after ingestion is complete.ingestTime - is the time at which the ingestion took place, in milliseconds since epoch.documentURI - is the URI of the document, which will be used as the key of the document in the index.activities - is an object providing a set of methods that the implementer can use to perform the operation.
ManifoldCFException
ServiceInterruption
protected boolean performIngestion(IOutputConnection connection,
java.lang.String docKey,
java.lang.String documentVersion,
java.lang.String outputVersion,
java.lang.String authorityNameString,
RepositoryDocument data,
long ingestTime,
java.lang.String documentURI,
IOutputActivity activities)
throws ManifoldCFException,
ServiceInterruption
ManifoldCFException
ServiceInterruption
public void documentCheckMultiple(java.lang.String outputConnectionName,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes,
long checkTime)
throws ManifoldCFException
documentCheckMultiple in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes - are the set of document identifier hashes.checkTime - is the time at which the check took place, in milliseconds since epoch.
ManifoldCFException
public void documentCheck(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash,
long checkTime)
throws ManifoldCFException
documentCheck in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.identifierClass - is the name of the space in which the identifier hash should be interpreted.identifierHash - is the hashed document identifier.checkTime - is the time at which the check took place, in milliseconds since epoch.
ManifoldCFException
protected void updateRowIds(java.util.ArrayList list,
java.lang.String queryPart,
long checkTime)
throws ManifoldCFException
ManifoldCFException
public void documentDeleteMultiple(java.lang.String[] outputConnectionNames,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes,
IOutputRemoveActivity activities)
throws ManifoldCFException,
ServiceInterruption
documentDeleteMultiple in interface IIncrementalIngesteroutputConnectionNames - are the names of the output connections associated with this action.identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes - is tha array of document identifier hashes if the documents.activities - is the object to use to log the details of the ingestion attempt. May be null.
ManifoldCFException
ServiceInterruption
public void documentDeleteMultiple(java.lang.String outputConnectionName,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes,
IOutputRemoveActivity activities)
throws ManifoldCFException,
ServiceInterruption
documentDeleteMultiple in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes - is tha array of document identifier hashes if the documents.activities - is the object to use to log the details of the ingestion attempt. May be null.
ManifoldCFException
ServiceInterruption
protected void findRowIdsForURIs(java.lang.String outputConnectionName,
java.util.HashMap rowIDSet,
java.util.HashMap uris,
java.util.ArrayList hashParamValues,
java.lang.String paramList)
throws ManifoldCFException
ManifoldCFException
protected void findRowIdsForDocIds(java.lang.String outputConnectionName,
java.util.HashMap rowIDSet,
java.util.ArrayList paramValues,
java.lang.String paramList)
throws ManifoldCFException
ManifoldCFException
protected void deleteRowIds(java.util.ArrayList list,
java.lang.String queryPart)
throws ManifoldCFException
ManifoldCFException
public void documentDelete(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash,
IOutputRemoveActivity activities)
throws ManifoldCFException,
ServiceInterruption
documentDelete in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.identifierClass - is the name of the space in which the identifier hash should be interpreted.identifierHash - is the hash of the id of the document.activities - is the object to use to log the details of the ingestion attempt. May be null.
ManifoldCFException
ServiceInterruption
protected IncrementalIngester.DeleteInfo[] getDocumentURIMultiple(java.lang.String outputConnectionName,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes)
throws ManifoldCFException
identifierHashes - is the array of document id's to check.
ManifoldCFException
public DocumentIngestStatus[] getDocumentIngestDataMultiple(java.lang.String[] outputConnectionNames,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes)
throws ManifoldCFException
getDocumentIngestDataMultiple in interface IIncrementalIngesteroutputConnectionNames - are the names of the output connections associated with this action.identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes - is the array of document identifier hashes to look up.
ManifoldCFException
public DocumentIngestStatus[] getDocumentIngestDataMultiple(java.lang.String outputConnectionName,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes)
throws ManifoldCFException
getDocumentIngestDataMultiple in interface IIncrementalIngesteroutputConnectionName - is the names of the output connection associated with this action.identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes - is the array of document identifier hashes to look up.
ManifoldCFException
public DocumentIngestStatus getDocumentIngestData(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash)
throws ManifoldCFException
getDocumentIngestData in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.identifierClass - is the name of the space in which the identifier hash should be interpreted.identifierHash - is the hash of the id of the document.
ManifoldCFException
public long getDocumentUpdateInterval(java.lang.String outputConnectionName,
java.lang.String identifierClass,
java.lang.String identifierHash)
throws ManifoldCFException
getDocumentUpdateInterval in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.identifierClass - is the name of the space in which the identifier hash should be interpreted.identifierHash - is the hash of the id of the document.
ManifoldCFException
public long[] getDocumentUpdateIntervalMultiple(java.lang.String outputConnectionName,
java.lang.String[] identifierClasses,
java.lang.String[] identifierHashes)
throws ManifoldCFException
getDocumentUpdateIntervalMultiple in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.identifierClasses - are the names of the spaces in which the identifier hashes should be interpreted.identifierHashes - is the hashes of the ids of the documents.
ManifoldCFException
protected void getIntervals(long[] rval,
java.lang.String outputConnectionName,
java.util.ArrayList list,
java.lang.String queryPart,
java.util.HashMap returnMap)
throws ManifoldCFException
rval - is the array to stuff calculated return values into.list - is the list of parameters.queryPart - is the part of the query pertaining to the list of hashcodesreturnMap - is a mapping from document id to rval index.
ManifoldCFException
public void resetOutputConnection(java.lang.String outputConnectionName)
throws ManifoldCFException
resetOutputConnection in interface IIncrementalIngesteroutputConnectionName - is the name of the output connection associated with this action.
ManifoldCFException
protected void noteDocumentIngest(java.lang.String outputConnectionName,
java.lang.String docKey,
java.lang.String documentVersion,
java.lang.String outputVersion,
java.lang.String authorityNameString,
long ingestTime,
java.lang.String documentURI,
java.lang.String documentURIHash)
throws ManifoldCFException
outputConnectionName - is the name of the output connection.docKey - is the key string describing the document.documentVersion - is a string describing the new version of the document.outputVersion - is the version string calculated for the output connection.authorityNameString - is the name of the relevant authority connection.ingestTime - is the time at which the ingestion took place, in milliseconds since epoch.documentURI - is the uri the document can be accessed at, or null (which signals that we are to record the version, but no
ingestion took place).documentURIHash - is the hash of the document uri.
ManifoldCFException
protected void getDocumentURIChunk(IncrementalIngester.DeleteInfo[] rval,
java.util.Map map,
java.lang.String outputConnectionName,
java.lang.String clause,
java.util.ArrayList list)
throws ManifoldCFException
rval - is the string array where the uris should be put.map - is the map from id to index.clause - is the in clause for the query.list - is the parameter list for the query.
ManifoldCFException
protected void getDocumentIngestDataChunk(DocumentIngestStatus[] rval,
java.util.Map map,
java.lang.String outputConnectionName,
java.lang.String clause,
java.util.ArrayList list)
throws ManifoldCFException
rval - is the document ingest status array where the data should be put.map - is the map from id to index.clause - is the in clause for the query.list - is the parameter list for the query.
ManifoldCFException
protected int addOrReplaceDocument(IOutputConnection connection,
java.lang.String documentURI,
java.lang.String outputDescription,
RepositoryDocument document,
java.lang.String authorityNameString,
IOutputAddActivity activities)
throws ManifoldCFException,
ServiceInterruption
ManifoldCFException
ServiceInterruption
protected void removeDocument(IOutputConnection connection,
java.lang.String documentURI,
java.lang.String outputDescription,
IOutputRemoveActivity activities)
throws ManifoldCFException,
ServiceInterruption
ManifoldCFException
ServiceInterruption
protected static java.lang.String makeKey(java.lang.String documentClass,
java.lang.String documentHash)
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||