org.apache.manifoldcf.crawler.interfaces
Class QueueTracker

java.lang.Object
  extended by org.apache.manifoldcf.crawler.interfaces.QueueTracker

public class QueueTracker
extends java.lang.Object

This class attempts to provide document priorities in order to acheive as much balance as possible between documents having different bins. A document's priority assignment takes place at the time the document is added to the queue, and will be recalculated when a job is aborted, or when the crawler daemon is started. The document priorities are strictly obeyed when documents are chosen from the queue and handed to worker threads; higher-priority documents always have precedence, except due to deliberate priority adjustment specified by the job priority. The priority values themselves are logarithmic: 0.0 is the highest, and the larger the number, the lower the priority. The basis for the calculation for each document priority handed out by this module are: - number of documents having a given bin (tracked) - performance of a connection (gathered through statistics) - throttling that applies to the each document bin The queuing prioritization model hooks into the document lifecycle in the following places: (1) When a document is added to the queue (and thus when its priority is handed out) (2) When documents that were *supposed* to be added to the queue turned out to already be there and already have an established priority, (in which case the priority that was handed out before is returned to the pool for reuse) (3) When a document is pulled from the database queue (which sets the current highest priority level that should not be exceeded in step (1)) The assignment prioritization model is largely independent of the queuing prioritization model, and is used to select among documents that have been marked "active" as they are handed to worker threads. These events cause information to be logged: (1) When a document is handed to a worker thread (2) When the worker thread completes the document


Nested Class Summary
protected static class QueueTracker.BinCount
          This is the class which allows a mutable integer count value to be saved in the bincount table.
protected static class QueueTracker.DoubleBinCount
          This is the class which allows a mutable integer count value to be saved in the bincount table.
protected static class QueueTracker.PriorityKey
          This is the key class for the availablePriorities table
protected static class QueueTracker.ThrottleLimits
          This class represents the throttle limits out of the connection specification
protected static class QueueTracker.ThrottleLimitSpec
          This is a class which describes an individual throttle limit, in fetches per millisecond.
 
Field Summary
static java.lang.String _rcsid
           
protected  java.util.HashMap activeBinCounts
          These are the bin counts for active threads
protected  java.util.HashMap availablePriorities
          This hash table is keyed by PriorityKey objects, and contains ArrayList objects containing Doubles, in sorted order.
protected  java.util.HashMap binCounts
          These are the bin counts for a prioritization pass.
protected  java.util.HashMap binDependencies
          This hash table is keyed by a String (which is the bin name), and contains a HashMap of PriorityKey objects containing that String as a bin
protected static double binReductionFactor
          Factor by which bins are reduced
protected  double currentMinimumDepth
          The "minimum depth" - which is the smallest bin count of the last document queued.
protected  PerformanceStatistics performanceStatistics
          These are the accumulated performance averages for all connections etc.
protected  java.util.HashMap queuedBinCounts
          These are the bin counts for tracking the documents that are on the active queue, but are not being processed yet
protected  boolean resetInProgress
          This flag, when set, indicates that a reset is in progress, so queuetracker bincount updates are ignored.
 
Constructor Summary
QueueTracker()
          Constructor
 
Method Summary
 void addRecord(java.lang.String[] binNames)
          Add an access record to the queue tracker.
 void assessMinimumDepth(java.lang.Double[] binNamesSet)
          Assess the current minimum depth.
 void beginProcessing(java.lang.String[] binNames)
          Note that we are beginning processing for a document with a particular set of bins.
 void beginReset()
          Reset the queue tracker.
 double calculateAssignmentRating(java.lang.String[] binNames, IRepositoryConnection connection)
          Calculate an assignment rating for a set of bins based on what's currently in use.
protected  double[] calculateMaxFetchRates(java.lang.String[] binNames, IRepositoryConnection connection)
          Calculate the maximum fetch rate for a given set of bins for a given connection.
 double calculatePriority(java.lang.String[] binNames, IRepositoryConnection connection)
          Calculate a document priority value.
 void endProcessing(java.lang.String[] binNames)
          Note that we have completed processing of a document with a given set of bins.
 void endReset()
          Finish the reset operation
 PerformanceStatistics getCurrentStatistics()
          Obtain the current performance statistics object
 void noteConnectionPerformance(int docCount, java.lang.String connectionName, long elapsedTime)
          Note the time required to successfully complete a set of documents.
 void notePriorityNotUsed(java.lang.String[] binNames, IRepositoryConnection connection, double priority)
          Note that a priority which was previously allocated was not used, and needs to be released.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_rcsid

public static final java.lang.String _rcsid
See Also:
Constant Field Values

binReductionFactor

protected static final double binReductionFactor
Factor by which bins are reduced

See Also:
Constant Field Values

performanceStatistics

protected PerformanceStatistics performanceStatistics
These are the accumulated performance averages for all connections etc.


binCounts

protected java.util.HashMap binCounts
These are the bin counts for a prioritization pass. This hash table is keyed by bin, and contains DoubleBinCount objects as values


queuedBinCounts

protected java.util.HashMap queuedBinCounts
These are the bin counts for tracking the documents that are on the active queue, but are not being processed yet


activeBinCounts

protected java.util.HashMap activeBinCounts
These are the bin counts for active threads


currentMinimumDepth

protected double currentMinimumDepth
The "minimum depth" - which is the smallest bin count of the last document queued. This helps guarantee that documents that are newly discovered don't wind up with high priority, but instead wind up about the same as the currently active document priority.


resetInProgress

protected boolean resetInProgress
This flag, when set, indicates that a reset is in progress, so queuetracker bincount updates are ignored.


availablePriorities

protected java.util.HashMap availablePriorities
This hash table is keyed by PriorityKey objects, and contains ArrayList objects containing Doubles, in sorted order.


binDependencies

protected java.util.HashMap binDependencies
This hash table is keyed by a String (which is the bin name), and contains a HashMap of PriorityKey objects containing that String as a bin

Constructor Detail

QueueTracker

public QueueTracker()
Constructor

Method Detail

beginReset

public void beginReset()
Reset the queue tracker. This occurs ONLY when we are about to reprioritize all active documents. It does not affect the portion of the queue tracker that tracks the active queue.


endReset

public void endReset()
Finish the reset operation


addRecord

public void addRecord(java.lang.String[] binNames)
Add an access record to the queue tracker. This happens when a document is added to the in-memory queue, and allows us to keep track of that particular event so we can schedule in a way that meets our distribution goals.

Parameters:
binNames - are the set of bins, as returned from the connector in question, for the document that is being queued. These bins are considered global in nature.

notePriorityNotUsed

public void notePriorityNotUsed(java.lang.String[] binNames,
                                IRepositoryConnection connection,
                                double priority)
Note that a priority which was previously allocated was not used, and needs to be released.


noteConnectionPerformance

public void noteConnectionPerformance(int docCount,
                                      java.lang.String connectionName,
                                      long elapsedTime)
Note the time required to successfully complete a set of documents. This allows this module to keep track of the performance characteristics of each individual connection, so distribution across connections can be balanced properly.


getCurrentStatistics

public PerformanceStatistics getCurrentStatistics()
Obtain the current performance statistics object


beginProcessing

public void beginProcessing(java.lang.String[] binNames)
Note that we are beginning processing for a document with a particular set of bins. This method is called when a worker thread starts work on a set of documents.


assessMinimumDepth

public void assessMinimumDepth(java.lang.Double[] binNamesSet)
Assess the current minimum depth. This method is called to provide to the QueueTracker information about the priorities of the documents being currently queued. It is the case that it is unoptimal to assign document priorities that are fundamentally higher than this value, because then the new documents will be preferentially queued, and the goal of distributing documents across bins will not be adequately met.

Parameters:
binNamesSet - is the current set of priorities we see on the queuing operation.

endProcessing

public void endProcessing(java.lang.String[] binNames)
Note that we have completed processing of a document with a given set of bins. This method gets called when a Worker Thread has finished with a document.


calculateAssignmentRating

public double calculateAssignmentRating(java.lang.String[] binNames,
                                        IRepositoryConnection connection)
Calculate an assignment rating for a set of bins based on what's currently in use. This rating is used to help determine which documents returned from a queueing query actually get made "active", and which ones are skipped for the moment. The rating returned for each bin will be 1 divided by one plus the active thread count for that bin. The higher the rating, the better. The ratings are combined by multiplying the rating for each bin by that for every other bin, and then taking the nth root (where n is the number of bins) to normalize for the number of bins. The repository connection is used to reduce the priority of assignment, based on the fetch rate that will result from this set of bins.


calculatePriority

public double calculatePriority(java.lang.String[] binNames,
                                IRepositoryConnection connection)
Calculate a document priority value. Priorities are reversed, and in log space, so that zero (0.0) is considered the highest possible priority, and larger priority values are considered lower in actual priority.

Parameters:
binNames - are the global bins to which the document belongs.
connection - is the connection, from which the throttles may be obtained. More highly throttled connections are given less favorable priority.
Returns:
the priority value, based on recent history. Also updates statistics atomically.

calculateMaxFetchRates

protected double[] calculateMaxFetchRates(java.lang.String[] binNames,
                                          IRepositoryConnection connection)
Calculate the maximum fetch rate for a given set of bins for a given connection. This is used to adjust the final priority of a document.