org.apache.manifoldcf.crawler.connectors.rss
Class ThrottledFetcher

java.lang.Object
  extended by org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher

public class ThrottledFetcher
extends java.lang.Object

This class uses httpclient to fetch stuff from webservers. However, it additionally controls the fetch rate in two ways: first, controlling the overall bandwidth used per server, and second, limiting the number of simultaneous open connections per server. It's also capable of limiting the maximum number of fetches per time period per server as well; however, this functionality is not strictly necessary at this time because the CF scheduler does that at a higher layer. An instance of this class would very probably need to have a lifetime consistent with the long-term nature of these values, and be static. This class sets up a different Http connection pool for each server, so that we can foist off onto the httpclient library the task of limiting the number of connections. This means that we need periodic polling to determine when idle pooled connections can be freed.


Nested Class Summary
protected static class ThrottledFetcher.DataRecorder
          This class takes care of recording data and results for posterity
protected static class ThrottledFetcher.DataSession
          Helper class for the above
protected  class ThrottledFetcher.Server
          This class represents the throttling stuff kept around for a single server.
protected static class ThrottledFetcher.ThrottledConnection
          This class represents an established connection to a URL.
protected static class ThrottledFetcher.ThrottledInputstream
          This class throttles an input stream based on the specified byte rate parameters.
 
Field Summary
static java.lang.String _rcsid
           
protected static java.lang.String dataFileFolder
           
protected static ThrottledFetcher.DataRecorder dataRecorder
           
protected static int globalHandleCount
          This counter keeps track of the total outstanding handles across everything, because we do try to control that
protected static java.lang.Integer globalHandleCounterLock
          This is the lock object for that global handle counter
protected static int READ_CHUNK_LENGTH
          The read chunk length
protected static boolean recordEverything
          This flag determines whether we record everything to the disk, as a means of doing a web snapshot
protected  int refCount
          Reference count for how many connections to this pool there are
protected static java.lang.String resultLogFile
           
protected  java.util.Map serverMap
          This hash maps the server string (without port) to a server object, where we can track the statistics and make sure we throttle appropriately
 
Constructor Summary
ThrottledFetcher()
          Constructor.
 
Method Summary
 IThrottledConnection createConnection(java.lang.String serverName, double minimumMillisecondsPerBytePerServer, int maxOpenConnectionsPerServer, long minimumMillisecondsPerFetchPerServer, int connectionLimit, int connectionTimeoutMilliseconds)
          Establish a connection to a specified URL.
 void noteConnectionEstablished()
          Note that there is a repository connection that is using this object.
 void noteConnectionReleased()
          Connection pool no longer needed.
 void poll()
          Poll.
protected static void registerGlobalHandle(int maxHandles)
          Note that we're about to need a handle (and make sure we have enough)
protected static void releaseGlobalHandle()
          Note that we're done with a handle (so we can free it)
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_rcsid

public static final java.lang.String _rcsid
See Also:
Constant Field Values

recordEverything

protected static final boolean recordEverything
This flag determines whether we record everything to the disk, as a means of doing a web snapshot

See Also:
Constant Field Values

READ_CHUNK_LENGTH

protected static final int READ_CHUNK_LENGTH
The read chunk length

See Also:
Constant Field Values

globalHandleCount

protected static int globalHandleCount
This counter keeps track of the total outstanding handles across everything, because we do try to control that


globalHandleCounterLock

protected static java.lang.Integer globalHandleCounterLock
This is the lock object for that global handle counter


serverMap

protected java.util.Map serverMap
This hash maps the server string (without port) to a server object, where we can track the statistics and make sure we throttle appropriately


refCount

protected int refCount
Reference count for how many connections to this pool there are


resultLogFile

protected static final java.lang.String resultLogFile
See Also:
Constant Field Values

dataFileFolder

protected static final java.lang.String dataFileFolder
See Also:
Constant Field Values

dataRecorder

protected static ThrottledFetcher.DataRecorder dataRecorder
Constructor Detail

ThrottledFetcher

public ThrottledFetcher()
Constructor.

Method Detail

registerGlobalHandle

protected static void registerGlobalHandle(int maxHandles)
                                    throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Note that we're about to need a handle (and make sure we have enough)

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

releaseGlobalHandle

protected static void releaseGlobalHandle()
Note that we're done with a handle (so we can free it)


createConnection

public IThrottledConnection createConnection(java.lang.String serverName,
                                             double minimumMillisecondsPerBytePerServer,
                                             int maxOpenConnectionsPerServer,
                                             long minimumMillisecondsPerFetchPerServer,
                                             int connectionLimit,
                                             int connectionTimeoutMilliseconds)
                                      throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                             org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Establish a connection to a specified URL.

Parameters:
serverName - is the FQDN of the server, e.g. foo.metacarta.com
minimumMillisecondsPerBytePerServer - is the average number of milliseconds to wait between bytes, on average, over all streams reading from this server. That means that the stream will block on fetch until the number of bytes being fetched, done in the average time interval required for that fetch, would not exceed the desired bandwidth.
minimumMillisecondsPerFetchPerServer - is the number of milliseconds between fetches, as a minimum, on a per-server basis. Set to zero for no limit.
maxOpenConnectionsPerServer - is the maximum number of open connections to allow for a single server. If more than this number of connections would need to be open, then this connection request will block until this number will no longer be exceeded.
connectionLimit - is the maximum desired outstanding connections at any one time.
connectionTimeoutMilliseconds - is the number of milliseconds to wait for the connection before timing out.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

poll

public void poll()
          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Poll. This method is designed to allow idle connections to be closed and freed.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

noteConnectionEstablished

public void noteConnectionEstablished()
Note that there is a repository connection that is using this object.


noteConnectionReleased

public void noteConnectionReleased()
Connection pool no longer needed. Call this to indicate that this object no longer needs to keep its pools available, for the moment.