org.apache.manifoldcf.crawler.system
Class WorkerThread.ProcessActivity

java.lang.Object
  extended by org.apache.manifoldcf.crawler.system.WorkerThread.ProcessActivity
All Implemented Interfaces:
IAbortActivity, IEventActivity, IFingerprintActivity, IHistoryActivity, INamingActivity, IProcessActivity
Enclosing class:
WorkerThread

protected static class WorkerThread.ProcessActivity
extends java.lang.Object
implements IProcessActivity

Process activity class wraps access to the ingester and job queue.


Field Summary
protected  java.util.HashMap abortSet
           
protected  IRepositoryConnection connection
           
protected  IRepositoryConnector connector
           
protected  IRepositoryConnectionManager connMgr
           
protected  long currentTime
           
protected  boolean ingestAllowed
           
protected  IIncrementalIngester ingester
           
protected  WorkerThread.OutputActivity ingestLogger
           
protected  IJobDescription job
           
protected  IJobManager jobManager
           
protected  java.lang.String[] legalLinkTypes
           
protected  java.util.HashMap lowerExpireBounds
           
protected  java.util.HashMap lowerRescheduleBounds
           
protected  java.util.HashMap originationTimes
           
protected  java.lang.String outputVersion
           
protected  QueueTracker queueTracker
           
protected  java.util.HashMap referenceList
           
protected  IThreadContext threadContext
           
protected  java.util.HashMap upperExpireBounds
           
protected  java.util.HashMap upperRescheduleBounds
           
 
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IProcessActivity
_rcsid
 
Constructor Summary
WorkerThread.ProcessActivity(IThreadContext threadContext, QueueTracker queueTracker, IJobManager jobManager, IIncrementalIngester ingester, long currentTime, IJobDescription job, IRepositoryConnection connection, IRepositoryConnector connector, IRepositoryConnectionManager connMgr, java.lang.String[] legalLinkTypes, WorkerThread.OutputActivity ingestLogger, java.util.HashMap abortSet, java.lang.String outputVersion)
          Constructor.
 
Method Summary
 void addDocumentReference(java.lang.String localIdentifier)
          Add a document description to the current job's queue.
 void addDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType)
          Add a document description to the current job's queue.
 void addDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues)
          Add a document description to the current job's queue.
 void addDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime)
          Add a document description to the current job's queue.
 void addDocumentReference(java.lang.String localIdentifier, java.lang.String parentIdentifier, java.lang.String relationshipType, java.lang.String[] dataNames, java.lang.Object[][] dataValues, java.lang.Long originationTime, java.lang.String[] prereqEventNames)
          Add a document description to the current job's queue.
 boolean beginEventSequence(java.lang.String eventName)
          Begin an event sequence.
 java.lang.Long calculateDocumentExpireTime(long currentTime, java.lang.String localIdentifier)
           
 java.lang.Long calculateDocumentRescheduleTime(long currentTime, long timeAmt, java.lang.String localIdentifier)
           
 boolean checkDocumentIndexable(java.io.File localFile)
          Check whether a document is indexable by the currently specified output connector.
 void checkJobStillActive()
          Check whether current job is still active.
 boolean checkMimeTypeIndexable(java.lang.String mimeType)
          Check whether a mime type is indexable by the currently specified output connector.
 void completeEventSequence(java.lang.String eventName)
          Complete an event sequence.
 java.lang.String createConnectionSpecificString(java.lang.String simpleString)
          Create a connection-specific string from a simple string.
 java.lang.String createGlobalString(java.lang.String simpleString)
          Create a global string from a simple string.
 java.lang.String createJobSpecificString(java.lang.String simpleString)
          Create a job-based string from a simple string.
 void deleteDocument(java.lang.String documentIdentifier)
          Delete the current document from the search engine index.
 void discard()
          Clean up any dangling information, before abandoning this process activity object
 void flush()
          Flush the outstanding references into the database.
 java.lang.Long getDocumentExpirationLowerBoundTime(java.lang.String localIdentifier)
          Find a document's lower expiration time bound, if any
 java.lang.Long getDocumentExpirationUpperBoundTime(java.lang.String localIdentifier)
          Find a document's upper expiration time bound, if any
 java.lang.Long getDocumentOriginationTime(java.lang.String localIdentifier)
          Get a document's origination time
 java.lang.Long getDocumentRescheduleLowerBoundTime(java.lang.String localIdentifier)
          Find a document's lower rescheduling time bound, if any
 java.lang.Long getDocumentRescheduleUpperBoundTime(java.lang.String localIdentifier)
          Find a document's upper rescheduling time bound, if any
 void ingestDocument(java.lang.String documentIdentifier, java.lang.String version, java.lang.String documentURI, RepositoryDocument data)
          Ingest the current document.
protected  void processDocumentReferences()
          Process outstanding document references, in batch.
 void recordActivity(java.lang.Long startTime, java.lang.String activityType, java.lang.Long dataSize, java.lang.String entityIdentifier, java.lang.String resultCode, java.lang.String resultDescription, java.lang.String[] childIdentifiers)
          Record time-stamped information about the activity of the connector.
 void recordDocument(java.lang.String documentIdentifier, java.lang.String version)
          Record a document version, but don't ingest it.
 void resetTimes()
          Reset the recorded times
 java.lang.String[] retrieveParentData(java.lang.String localIdentifier, java.lang.String dataName)
          Retrieve data passed from parents to a specified child document.
 CharacterInput[] retrieveParentDataAsFiles(java.lang.String localIdentifier, java.lang.String dataName)
          Retrieve data passed from parents to a specified child document.
 void retryDocumentProcessing(java.lang.String localIdentifier)
          Abort processing a document (for sequencing reasons).
 void setDocumentOriginationTime(java.lang.String localIdentifier, java.lang.Long originationTime)
          Override a document's origination time.
 void setDocumentScheduleBounds(java.lang.String localIdentifier, java.lang.Long lowerRecrawlBoundTime, java.lang.Long upperRecrawlBoundTime, java.lang.Long lowerExpireBoundTime, java.lang.Long upperExpireBoundTime)
          Override the schedule for the next time a document is crawled.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

threadContext

protected IThreadContext threadContext

jobManager

protected IJobManager jobManager

ingester

protected IIncrementalIngester ingester

ingestAllowed

protected boolean ingestAllowed

currentTime

protected long currentTime

job

protected IJobDescription job

connection

protected IRepositoryConnection connection

connector

protected IRepositoryConnector connector

connMgr

protected IRepositoryConnectionManager connMgr

legalLinkTypes

protected java.lang.String[] legalLinkTypes

ingestLogger

protected WorkerThread.OutputActivity ingestLogger

queueTracker

protected QueueTracker queueTracker

abortSet

protected java.util.HashMap abortSet

outputVersion

protected java.lang.String outputVersion

referenceList

protected java.util.HashMap referenceList

lowerRescheduleBounds

protected java.util.HashMap lowerRescheduleBounds

upperRescheduleBounds

protected java.util.HashMap upperRescheduleBounds

lowerExpireBounds

protected java.util.HashMap lowerExpireBounds

upperExpireBounds

protected java.util.HashMap upperExpireBounds

originationTimes

protected java.util.HashMap originationTimes
Constructor Detail

WorkerThread.ProcessActivity

public WorkerThread.ProcessActivity(IThreadContext threadContext,
                                    QueueTracker queueTracker,
                                    IJobManager jobManager,
                                    IIncrementalIngester ingester,
                                    long currentTime,
                                    IJobDescription job,
                                    IRepositoryConnection connection,
                                    IRepositoryConnector connector,
                                    IRepositoryConnectionManager connMgr,
                                    java.lang.String[] legalLinkTypes,
                                    WorkerThread.OutputActivity ingestLogger,
                                    java.util.HashMap abortSet,
                                    java.lang.String outputVersion)
Constructor.

Parameters:
jobManager - is the job manager
ingester - is the ingester
Method Detail

discard

public void discard()
             throws ManifoldCFException
Clean up any dangling information, before abandoning this process activity object

Throws:
ManifoldCFException

addDocumentReference

public void addDocumentReference(java.lang.String localIdentifier,
                                 java.lang.String parentIdentifier,
                                 java.lang.String relationshipType,
                                 java.lang.String[] dataNames,
                                 java.lang.Object[][] dataValues,
                                 java.lang.Long originationTime,
                                 java.lang.String[] prereqEventNames)
                          throws ManifoldCFException
Add a document description to the current job's queue.

Specified by:
addDocumentReference in interface IProcessActivity
Parameters:
localIdentifier - is the local document identifier to add (for the connector that fetched the document).
parentIdentifier - is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.
relationshipType - is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.
dataNames - is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!
dataValues - are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null. The type of each object must either be a String, or a CharacterInput.
originationTime - is the time, in ms since epoch, that the document originated. Pass null if none or unknown.
prereqEventNames - are the names of the prerequisite events which this document requires prior to processing. Pass null if none.
Throws:
ManifoldCFException

addDocumentReference

public void addDocumentReference(java.lang.String localIdentifier,
                                 java.lang.String parentIdentifier,
                                 java.lang.String relationshipType,
                                 java.lang.String[] dataNames,
                                 java.lang.Object[][] dataValues,
                                 java.lang.Long originationTime)
                          throws ManifoldCFException
Add a document description to the current job's queue.

Specified by:
addDocumentReference in interface IProcessActivity
Parameters:
localIdentifier - is the local document identifier to add (for the connector that fetched the document).
parentIdentifier - is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.
relationshipType - is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.
dataNames - is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!
dataValues - are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null.
originationTime - is the time, in ms since epoch, that the document originated. Pass null if none or unknown.
Throws:
ManifoldCFException

addDocumentReference

public void addDocumentReference(java.lang.String localIdentifier,
                                 java.lang.String parentIdentifier,
                                 java.lang.String relationshipType,
                                 java.lang.String[] dataNames,
                                 java.lang.Object[][] dataValues)
                          throws ManifoldCFException
Add a document description to the current job's queue.

Specified by:
addDocumentReference in interface IProcessActivity
Parameters:
localIdentifier - is the local document identifier to add (for the connector that fetched the document).
parentIdentifier - is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.
relationshipType - is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.
dataNames - is the list of carry-down data from the parent to the child. May be null. Each name is limited to 255 characters!
dataValues - are the values that correspond to the data names in the dataNames parameter. May be null only if dataNames is null.
Throws:
ManifoldCFException

addDocumentReference

public void addDocumentReference(java.lang.String localIdentifier,
                                 java.lang.String parentIdentifier,
                                 java.lang.String relationshipType)
                          throws ManifoldCFException
Add a document description to the current job's queue.

Specified by:
addDocumentReference in interface IProcessActivity
Parameters:
localIdentifier - is the local document identifier to add (for the connector that fetched the document).
parentIdentifier - is the document identifier that is considered to be the "parent" of this identifier. May be null, if no hopcount filtering desired for this kind of relationship.
relationshipType - is the string describing the kind of relationship described by this reference. This must be one of the strings returned by the IRepositoryConnector method "getRelationshipTypes()". May be null.
Throws:
ManifoldCFException

addDocumentReference

public void addDocumentReference(java.lang.String localIdentifier)
                          throws ManifoldCFException
Add a document description to the current job's queue. This method is equivalent to addDocumentReference(localIdentifier,null,null).

Specified by:
addDocumentReference in interface IProcessActivity
Parameters:
localIdentifier - is the local document identifier to add (for the connector that fetched the document).
Throws:
ManifoldCFException

retrieveParentData

public java.lang.String[] retrieveParentData(java.lang.String localIdentifier,
                                             java.lang.String dataName)
                                      throws ManifoldCFException
Retrieve data passed from parents to a specified child document.

Specified by:
retrieveParentData in interface IProcessActivity
Parameters:
localIdentifier - is the document identifier of the document we want the recorded data for.
dataName - is the name of the data items to retrieve.
Returns:
an array containing the unique data values passed from ALL parents. Note that these are in no particular order, and there will not be any duplicates.
Throws:
ManifoldCFException

retrieveParentDataAsFiles

public CharacterInput[] retrieveParentDataAsFiles(java.lang.String localIdentifier,
                                                  java.lang.String dataName)
                                           throws ManifoldCFException
Retrieve data passed from parents to a specified child document.

Specified by:
retrieveParentDataAsFiles in interface IProcessActivity
Parameters:
localIdentifier - is the document identifier of the document we want the recorded data for.
dataName - is the name of the data items to retrieve.
Returns:
an array containing the unique data values passed from ALL parents. Note that these are in no particular order, and there will not be any duplicates.
Throws:
ManifoldCFException

recordDocument

public void recordDocument(java.lang.String documentIdentifier,
                           java.lang.String version)
                    throws ManifoldCFException,
                           ServiceInterruption
Record a document version, but don't ingest it. ServiceInterruption is thrown if this action must be rescheduled.

Specified by:
recordDocument in interface IProcessActivity
Parameters:
documentIdentifier - is the document identifier.
version - is the document version.
Throws:
ManifoldCFException
ServiceInterruption

ingestDocument

public void ingestDocument(java.lang.String documentIdentifier,
                           java.lang.String version,
                           java.lang.String documentURI,
                           RepositoryDocument data)
                    throws ManifoldCFException,
                           ServiceInterruption
Ingest the current document.

Specified by:
ingestDocument in interface IProcessActivity
Parameters:
documentIdentifier - is the document's local identifier.
version - is the version of the document, as reported by the getDocumentVersions() method of the corresponding repository connector.
documentURI - is the URI to use to retrieve this document from the search interface (and is also the unique key in the index).
data - is the document data. The data is closed after ingestion is complete.
Throws:
ManifoldCFException
ServiceInterruption

deleteDocument

public void deleteDocument(java.lang.String documentIdentifier)
                    throws ManifoldCFException,
                           ServiceInterruption
Delete the current document from the search engine index.

Specified by:
deleteDocument in interface IProcessActivity
Parameters:
documentIdentifier - is the document's local identifier.
Throws:
ManifoldCFException
ServiceInterruption

setDocumentScheduleBounds

public void setDocumentScheduleBounds(java.lang.String localIdentifier,
                                      java.lang.Long lowerRecrawlBoundTime,
                                      java.lang.Long upperRecrawlBoundTime,
                                      java.lang.Long lowerExpireBoundTime,
                                      java.lang.Long upperExpireBoundTime)
                               throws ManifoldCFException
Override the schedule for the next time a document is crawled. Calling this method allows you to set an upper recrawl bound, lower recrawl bound, upper expire bound, lower expire bound, or a combination of these, on a specific document. This method is only effective if the job is a continuous one, and if the identifier you pass in is being processed.

Specified by:
setDocumentScheduleBounds in interface IProcessActivity
Parameters:
localIdentifier - is the document's local identifier.
lowerRecrawlBoundTime - is the time in ms since epoch that the reschedule time should not fall BELOW, or null if none.
upperRecrawlBoundTime - is the time in ms since epoch that the reschedule time should not rise ABOVE, or null if none.
lowerExpireBoundTime - is the time in ms since epoch that the expire time should not fall BELOW, or null if none.
upperExpireBoundTime - is the time in ms since epoch that the expire time should not rise ABOVE, or null if none.
Throws:
ManifoldCFException

setDocumentOriginationTime

public void setDocumentOriginationTime(java.lang.String localIdentifier,
                                       java.lang.Long originationTime)
                                throws ManifoldCFException
Override a document's origination time. Use this method to signal the framework that a document's origination time is something other than the first time it was crawled.

Specified by:
setDocumentOriginationTime in interface IProcessActivity
Parameters:
localIdentifier - is the document's local identifier.
originationTime - is the document's origination time, or null if unknown.
Throws:
ManifoldCFException

getDocumentRescheduleLowerBoundTime

public java.lang.Long getDocumentRescheduleLowerBoundTime(java.lang.String localIdentifier)
Find a document's lower rescheduling time bound, if any


getDocumentRescheduleUpperBoundTime

public java.lang.Long getDocumentRescheduleUpperBoundTime(java.lang.String localIdentifier)
Find a document's upper rescheduling time bound, if any


getDocumentExpirationLowerBoundTime

public java.lang.Long getDocumentExpirationLowerBoundTime(java.lang.String localIdentifier)
Find a document's lower expiration time bound, if any


getDocumentExpirationUpperBoundTime

public java.lang.Long getDocumentExpirationUpperBoundTime(java.lang.String localIdentifier)
Find a document's upper expiration time bound, if any


getDocumentOriginationTime

public java.lang.Long getDocumentOriginationTime(java.lang.String localIdentifier)
Get a document's origination time


calculateDocumentRescheduleTime

public java.lang.Long calculateDocumentRescheduleTime(long currentTime,
                                                      long timeAmt,
                                                      java.lang.String localIdentifier)

calculateDocumentExpireTime

public java.lang.Long calculateDocumentExpireTime(long currentTime,
                                                  java.lang.String localIdentifier)

resetTimes

public void resetTimes()
Reset the recorded times


recordActivity

public void recordActivity(java.lang.Long startTime,
                           java.lang.String activityType,
                           java.lang.Long dataSize,
                           java.lang.String entityIdentifier,
                           java.lang.String resultCode,
                           java.lang.String resultDescription,
                           java.lang.String[] childIdentifiers)
                    throws ManifoldCFException
Record time-stamped information about the activity of the connector.

Specified by:
recordActivity in interface IHistoryActivity
Parameters:
startTime - is either null or the time since the start of epoch in milliseconds (Jan 1, 1970). Every activity has an associated time; the startTime field records when the activity began. A null value indicates that the start time and the finishing time are the same.
activityType - is a string which is fully interpretable only in the context of the connector involved, which is used to categorize what kind of activity is being recorded. For example, a web connector might record a "fetch document" activity. Cannot be null.
dataSize - is the number of bytes of data involved in the activity, or null if not applicable.
entityIdentifier - is a (possibly long) string which identifies the object involved in the history record. The interpretation of this field will differ from connector to connector. May be null.
resultCode - contains a terse description of the result of the activity. The description is limited in size to 255 characters, and can be interpreted only in the context of the current connector. May be null.
resultDescription - is a (possibly long) human-readable string which adds detail, if required, to the result described in the resultCode field. This field is not meant to be queried on. May be null.
childIdentifiers - is a set of child entity identifiers associated with this activity. May be null.
Throws:
ManifoldCFException

flush

public void flush()
           throws ManifoldCFException
Flush the outstanding references into the database.

Throws:
ManifoldCFException

processDocumentReferences

protected void processDocumentReferences()
                                  throws ManifoldCFException
Process outstanding document references, in batch.

Throws:
ManifoldCFException

checkJobStillActive

public void checkJobStillActive()
                         throws ManifoldCFException,
                                ServiceInterruption
Check whether current job is still active. This method is provided to allow an individual connector that needs to wait on some long-term condition to give up waiting due to the job itself being aborted. If the connector should abort, this method will raise a properly-formed ServiceInterruption, which if thrown to the caller, will signal that the current processing activity remains incomplete and must be retried when the job is resumed.

Specified by:
checkJobStillActive in interface IAbortActivity
Throws:
ManifoldCFException
ServiceInterruption

beginEventSequence

public boolean beginEventSequence(java.lang.String eventName)
                           throws ManifoldCFException
Begin an event sequence. This method should be called by a connector when a sequencing event should enter the "pending" state. If the event is already in that state, this method will return false, otherwise true. The connector has the responsibility of appropriately managing sequencing given the response status.

Specified by:
beginEventSequence in interface IEventActivity
Parameters:
eventName - is the event name.
Returns:
false if the event is already in the "pending" state.
Throws:
ManifoldCFException

completeEventSequence

public void completeEventSequence(java.lang.String eventName)
                           throws ManifoldCFException
Complete an event sequence. This method should be called to signal that an event is no longer in the "pending" state. This can mean that the prerequisite processing is completed, but it can also mean that prerequisite processing was aborted or cannot be completed. Note well: This method should not be called unless the connector is CERTAIN that an event is in progress, and that the current thread has the sole right to complete it. Otherwise, race conditions can develop which would be difficult to diagnose.

Specified by:
completeEventSequence in interface IEventActivity
Parameters:
eventName - is the event name.
Throws:
ManifoldCFException

retryDocumentProcessing

public void retryDocumentProcessing(java.lang.String localIdentifier)
                             throws ManifoldCFException
Abort processing a document (for sequencing reasons). This method should be called in order to cause the specified document to be requeued for later processing. While this is similar in some respects to the semantics of a ServiceInterruption, it is applicable to only one document at a time, and also does not specify any delay period, since it is presumed that the reason for the requeue is because of sequencing issues synchronized around an underlying event.

Specified by:
retryDocumentProcessing in interface IEventActivity
Parameters:
localIdentifier - is the document identifier to requeue
Throws:
ManifoldCFException

checkMimeTypeIndexable

public boolean checkMimeTypeIndexable(java.lang.String mimeType)
                               throws ManifoldCFException,
                                      ServiceInterruption
Check whether a mime type is indexable by the currently specified output connector.

Specified by:
checkMimeTypeIndexable in interface IFingerprintActivity
Parameters:
mimeType - is the mime type to check, not including any character set specification.
Returns:
true if the mime type is indexable.
Throws:
ManifoldCFException
ServiceInterruption

checkDocumentIndexable

public boolean checkDocumentIndexable(java.io.File localFile)
                               throws ManifoldCFException,
                                      ServiceInterruption
Check whether a document is indexable by the currently specified output connector.

Specified by:
checkDocumentIndexable in interface IFingerprintActivity
Parameters:
localFile - is the local copy of the file to check.
Returns:
true if the document is indexable.
Throws:
ManifoldCFException
ServiceInterruption

createGlobalString

public java.lang.String createGlobalString(java.lang.String simpleString)
Create a global string from a simple string.

Specified by:
createGlobalString in interface INamingActivity
Parameters:
simpleString - is the simple string.
Returns:
a global string.

createConnectionSpecificString

public java.lang.String createConnectionSpecificString(java.lang.String simpleString)
Create a connection-specific string from a simple string.

Specified by:
createConnectionSpecificString in interface INamingActivity
Parameters:
simpleString - is the simple string.
Returns:
a connection-specific string.

createJobSpecificString

public java.lang.String createJobSpecificString(java.lang.String simpleString)
Create a job-based string from a simple string.

Specified by:
createJobSpecificString in interface INamingActivity
Parameters:
simpleString - is the simple string.
Returns:
a job-specific string.