org.apache.manifoldcf.crawler.connectors.webcrawler
Class ThrottledFetcher.ThrottleBin

java.lang.Object
  extended by org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher.ThrottleBin
Enclosing class:
ThrottledFetcher

protected static class ThrottledFetcher.ThrottleBin
extends java.lang.Object

Throttles for a bin. An instance of this class keeps track of the information needed to bandwidth throttle access to a url belonging to a specific bin. In order to calculate the effective "burst" fetches per second and bytes per second, we need to have some idea what the window is. For example, a long hiatus from fetching could cause overuse of the server when fetching resumes, if the window length is too long. One solution to this problem would be to keep a list of the individual fetches as records. Then, we could "expire" a fetch by discarding the old record. However, this is quite memory consumptive for all but the smallest intervals. Another, better, solution is to hook into the start and end of individual fetches. These will, presumably, occur at the fastest possible rate without long pauses spent doing something else. The only complication is that fetches may well overlap, so we need to "reference count" the fetches to know when to reset the counters. For "fetches per second", we can simply make sure we "schedule" the next fetch at an appropriate time, rather than keep records around. The overall rate may therefore be somewhat less than the specified rate, but that's perfectly acceptable. Some notes on the algorithms used to limit server bandwidth impact ================================================================== In a single connection case, the algorithm we'd want to use works like this. On the first chunk of a series, the total length of time and the number of bytes are recorded. Then, prior to each subsequent chunk, a calculation is done which attempts to hit the bandwidth target by the end of the chunk read, using the rate of the first chunk access as a way of estimating how long it will take to fetch those next n bytes. For a multi-connection case, which this is, it's harder to either come up with a good maximum bandwidth estimate, and harder still to "hit the target", because simultaneous fetches will intrude. The strategy is therefore: 1) The first chunk of any series should proceed without interference from other connections to the same server. The goal here is to get a decent quality estimate without any possibility of overwhelming the server. 2) The bandwidth of the first chunk is treated as the "maximum bandwidth per connection". That is, if other connections are going on, we can presume that each connection will use at most the bandwidth that the first fetch took. Thus, by generating end-time estimates based on this number, we are actually being conservative and using less server bandwidth. 3) For chunks that have started but not finished, we keep track of their size and estimated elapsed time in order to schedule when new chunks from other connections can start.


Field Summary
protected  java.lang.String binName
          This is the bin name which this throttle belongs to.
protected  boolean estimateInProgress
          Flag indicating whether rate estimation is in progress yet
protected  boolean estimateValid
          Flag indicating whether a rate estimate is needed
protected  java.lang.Integer firstChunkLock
          This object is used to gate access while the first chunk is being read
protected  double rateEstimate
          The inverse rate estimate of the first fetch, in ms/byte
protected  int refCount
          This is the reference count for this bin (which records active references)
protected  long seriesStartTime
          The start time of this series
protected  long totalBytesRead
          Total actual bytes read in this series; this includes fetches in progress
 
Constructor Summary
ThrottledFetcher.ThrottleBin(java.lang.String binName)
          Constructor.
 
Method Summary
 void beginFetch()
          Note the start of a fetch operation for a bin.
 void beginRead(int byteCount, double minimumMillisecondsPerBytePerServer)
          Note the start of an individual byte read of a specified size.
 boolean endFetch()
          Note the end of a fetch operation.
 void endRead(int originalCount, int actualCount)
          Note the end of an individual read from the server.
 java.lang.String getBinName()
          Get the bin name.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

binName

protected java.lang.String binName
This is the bin name which this throttle belongs to.


refCount

protected int refCount
This is the reference count for this bin (which records active references)


rateEstimate

protected double rateEstimate
The inverse rate estimate of the first fetch, in ms/byte


estimateValid

protected boolean estimateValid
Flag indicating whether a rate estimate is needed


estimateInProgress

protected boolean estimateInProgress
Flag indicating whether rate estimation is in progress yet


seriesStartTime

protected long seriesStartTime
The start time of this series


totalBytesRead

protected long totalBytesRead
Total actual bytes read in this series; this includes fetches in progress


firstChunkLock

protected java.lang.Integer firstChunkLock
This object is used to gate access while the first chunk is being read

Constructor Detail

ThrottledFetcher.ThrottleBin

public ThrottledFetcher.ThrottleBin(java.lang.String binName)
Constructor.

Method Detail

getBinName

public java.lang.String getBinName()
Get the bin name.


beginFetch

public void beginFetch()
                throws java.lang.InterruptedException
Note the start of a fetch operation for a bin. Call this method just before the actual stream access begins. May wait until schedule allows.

Throws:
java.lang.InterruptedException

beginRead

public void beginRead(int byteCount,
                      double minimumMillisecondsPerBytePerServer)
               throws java.lang.InterruptedException
Note the start of an individual byte read of a specified size. Call this method just before the read request takes place. Performs the necessary delay prior to reading specified number of bytes from the server.

Throws:
java.lang.InterruptedException

endRead

public void endRead(int originalCount,
                    int actualCount)
Note the end of an individual read from the server. Call this just after an individual read completes. Pass the actual number of bytes read to the method.


endFetch

public boolean endFetch()
Note the end of a fetch operation. Call this method just after the fetch completes.