org.apache.manifoldcf.crawler.connectors.rss
Class ThrottledFetcher.Server

java.lang.Object
  extended by org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher.Server
Enclosing class:
ThrottledFetcher

protected class ThrottledFetcher.Server
extends java.lang.Object

This class represents the throttling stuff kept around for a single server. In order to calculate the effective "burst" fetches per second and bytes per second, we need to have some idea what the window is. For example, a long hiatus from fetching could cause overuse of the server when fetching resumes, if the window length is too long. One solution to this problem would be to keep a list of the individual fetches as records. Then, we could "expire" a fetch by discarding the old record. However, this is quite memory consumptive for all but the smallest intervals. Another, better, solution is to hook into the start and end of individual fetches. These will, presumably, occur at the fastest possible rate without long pauses spent doing something else. The only complication is that fetches may well overlap, so we need to "reference count" the fetches to know when to reset the counters. For "fetches per second", we can simply make sure we "schedule" the next fetch at an appropriate time, rather than keep records around. The overall rate may therefore be somewhat less than the specified rate, but that's perfectly acceptable. For the "maximum open connections" limit, the best thing would be to establish a separate MultiThreadedConnectionPool for each Server. Then, the limit would be automatic. Some notes on the algorithms used to limit server bandwidth impact ================================================================== In a single connection case, the algorithm we'd want to use works like this. On the first chunk of a series, the total length of time and the number of bytes are recorded. Then, prior to each subsequent chunk, a calculation is done which attempts to hit the bandwidth target by the end of the chunk read, using the rate of the first chunk access as a way of estimating how long it will take to fetch those next n bytes. For a multi-connection case, which this is, it's harder to either come up with a good maximum bandwidth estimate, and harder still to "hit the target", because simultaneous fetches will intrude. The strategy is therefore: 1) The first chunk of any series should proceed without interference from other connections to the same server. The goal here is to get a decent quality estimate without any possibility of overwhelming the server. 2) The bandwidth of the first chunk is treated as the "maximum bandwidth per connection". That is, if other connections are going on, we can presume that each connection will use at most the bandwidth that the first fetch took. Thus, by generating end-time estimates based on this number, we are actually being conservative and using less server bandwidth. 3) For chunks that have started but not finished, we keep track of their size and estimated elapsed time in order to schedule when new chunks from other connections can start.


Field Summary
protected  boolean estimateInProgress
          Flag indicating whether rate estimation is in progress yet
protected  boolean estimateValid
          Flag indicating whether a rate estimate is needed
protected  java.lang.Integer firstChunkLock
          This object is used to gate access while the first chunk is being read
protected  long nextFetchTime
          This is the time of the next allowed fetch (in ms since epoch)
protected  int outstandingConnections
          Outstanding connection counter
protected  double rateEstimate
          The inverse rate estimate of the first fetch, in ms/byte
protected  int refCount
          Reference count for bandwidth variables
protected  long seriesStartTime
          The start time of this series
protected  java.lang.String serverName
          The fqdn of the server
protected  long totalBytesRead
          Total actual bytes read in this series; this includes fetches in progress
 
Constructor Summary
ThrottledFetcher.Server(java.lang.String serverName)
          Constructor
 
Method Summary
 void beginFetch(long minimumMillisecondsPerFetchPerServer)
          Note the start of a fetch operation.
 void beginRead(int byteCount, double minimumMillisecondsPerBytePerServer)
          Note the start of an individual byte read of a specified size.
 void discard()
          Discard this server.
 void endFetch()
          Note the end of a fetch operation.
 void endRead(int originalCount, int actualCount)
          Note the end of an individual read from the server.
 java.lang.String getServerName()
          Get the fqdn of the server
 void registerConnection(int maxOutstandingConnections)
          Register an outstanding connection (and wait until it can be obtained before proceeding)
 void releaseConnection()
          Release an outstanding connection back into the pool
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

serverName

protected java.lang.String serverName
The fqdn of the server


nextFetchTime

protected long nextFetchTime
This is the time of the next allowed fetch (in ms since epoch)


refCount

protected int refCount
Reference count for bandwidth variables


rateEstimate

protected double rateEstimate
The inverse rate estimate of the first fetch, in ms/byte


estimateValid

protected boolean estimateValid
Flag indicating whether a rate estimate is needed


estimateInProgress

protected boolean estimateInProgress
Flag indicating whether rate estimation is in progress yet


seriesStartTime

protected long seriesStartTime
The start time of this series


totalBytesRead

protected long totalBytesRead
Total actual bytes read in this series; this includes fetches in progress


firstChunkLock

protected java.lang.Integer firstChunkLock
This object is used to gate access while the first chunk is being read


outstandingConnections

protected int outstandingConnections
Outstanding connection counter

Constructor Detail

ThrottledFetcher.Server

public ThrottledFetcher.Server(java.lang.String serverName)
Constructor

Method Detail

getServerName

public java.lang.String getServerName()
Get the fqdn of the server


registerConnection

public void registerConnection(int maxOutstandingConnections)
                        throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Register an outstanding connection (and wait until it can be obtained before proceeding)

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

releaseConnection

public void releaseConnection()
Release an outstanding connection back into the pool


beginFetch

public void beginFetch(long minimumMillisecondsPerFetchPerServer)
                throws java.lang.InterruptedException
Note the start of a fetch operation. Call this method just before the actual stream access begins. May wait until schedule allows.

Throws:
java.lang.InterruptedException

endFetch

public void endFetch()
Note the end of a fetch operation. Call this method just after the fetch completes.


beginRead

public void beginRead(int byteCount,
                      double minimumMillisecondsPerBytePerServer)
               throws java.lang.InterruptedException
Note the start of an individual byte read of a specified size. Call this method just before the read request takes place. Performs the necessary delay prior to reading specified number of bytes from the server.

Throws:
java.lang.InterruptedException

endRead

public void endRead(int originalCount,
                    int actualCount)
Note the end of an individual read from the server. Call this just after an individual read completes. Pass the actual number of bytes read to the method.


discard

public void discard()
Discard this server.