org.apache.manifoldcf.crawler.connectors.webcrawler
Class ThrottledFetcher

java.lang.Object
  extended by org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher

public class ThrottledFetcher
extends java.lang.Object

This class uses httpclient to fetch stuff from webservers. However, it additionally controls the fetch rate in two ways: first, controlling the overall bandwidth used per server, and second, limiting the number of simultaneous open connections per server. An instance of this class would very probably need to have a lifetime consistent with the long-term nature of these values, and be static.


Nested Class Summary
protected static class ThrottledFetcher.ConnectionBin
          Connection pool for a bin.
protected static class ThrottledFetcher.DataRecorder
          This class takes care of recording data and results for posterity
protected static class ThrottledFetcher.DataSession
          Helper class for the above
protected static class ThrottledFetcher.PoolException
          Pool exception class
protected static class ThrottledFetcher.SocketCreateThread
          Create a secure socket in a thread, so that we can "give up" after a while if the socket fails to connect.
protected static class ThrottledFetcher.ThrottleBin
          Throttles for a bin.
protected static class ThrottledFetcher.ThrottledConnection
          Throttled connections.
protected static class ThrottledFetcher.ThrottledInputstream
          This class throttles an input stream based on the specified byte rate parameters.
protected static class ThrottledFetcher.WaitException
          Wait exception class
protected static class ThrottledFetcher.WebSecureSocketFactory
          HTTPClient secure socket factory, which implements SecureProtocolSocketFactory
 
Field Summary
static java.lang.String _rcsid
           
protected static java.util.HashMap connectionBins
          This is the static pool of ConnectionBin's, keyed by bin name.
protected static java.lang.String dataFileFolder
           
protected static ThrottledFetcher.DataRecorder dataRecorder
           
protected static java.lang.Integer poolLock
          This global lock protects the "distributed pool" resource, and insures that a connection can get pulled out of all the right pools and wind up in only the hands of one thread.
protected static int READ_CHUNK_LENGTH
          The read chunk length
protected static boolean recordEverything
          This flag determines whether we record everything to the disk, as a means of doing a web snapshot
protected static java.lang.String resultLogFile
           
protected static java.util.HashMap throttleBins
          This is the static pool of ThrottleBin's, keyed by bin name.
protected static long TIME_15MIN
           
protected static long TIME_1DAY
           
protected static long TIME_2HRS
           
protected static long TIME_5MIN
           
protected static long TIME_6HRS
           
 
Constructor Summary
ThrottledFetcher()
          Constructor.
 
Method Summary
static void flushIdleConnections()
          Flush connections that have timed out from inactivity.
static IThrottledConnection getConnection(java.lang.String protocol, java.lang.String server, int port, PageCredentials authentication, org.apache.manifoldcf.core.interfaces.IKeystoreManager trustStore, ThrottleDescription throttleDescription, java.lang.String[] binNames, int connectionLimit)
          Obtain a connection to specified protocol, server, and port.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_rcsid

public static final java.lang.String _rcsid
See Also:
Constant Field Values

recordEverything

protected static final boolean recordEverything
This flag determines whether we record everything to the disk, as a means of doing a web snapshot

See Also:
Constant Field Values

TIME_2HRS

protected static final long TIME_2HRS
See Also:
Constant Field Values

TIME_5MIN

protected static final long TIME_5MIN
See Also:
Constant Field Values

TIME_15MIN

protected static final long TIME_15MIN
See Also:
Constant Field Values

TIME_6HRS

protected static final long TIME_6HRS
See Also:
Constant Field Values

TIME_1DAY

protected static final long TIME_1DAY
See Also:
Constant Field Values

connectionBins

protected static java.util.HashMap connectionBins
This is the static pool of ConnectionBin's, keyed by bin name.


throttleBins

protected static java.util.HashMap throttleBins
This is the static pool of ThrottleBin's, keyed by bin name.


poolLock

protected static java.lang.Integer poolLock
This global lock protects the "distributed pool" resource, and insures that a connection can get pulled out of all the right pools and wind up in only the hands of one thread.


READ_CHUNK_LENGTH

protected static final int READ_CHUNK_LENGTH
The read chunk length

See Also:
Constant Field Values

resultLogFile

protected static final java.lang.String resultLogFile
See Also:
Constant Field Values

dataFileFolder

protected static final java.lang.String dataFileFolder
See Also:
Constant Field Values

dataRecorder

protected static ThrottledFetcher.DataRecorder dataRecorder
Constructor Detail

ThrottledFetcher

public ThrottledFetcher()
Constructor.

Method Detail

getConnection

public static IThrottledConnection getConnection(java.lang.String protocol,
                                                 java.lang.String server,
                                                 int port,
                                                 PageCredentials authentication,
                                                 org.apache.manifoldcf.core.interfaces.IKeystoreManager trustStore,
                                                 ThrottleDescription throttleDescription,
                                                 java.lang.String[] binNames,
                                                 int connectionLimit)
                                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Obtain a connection to specified protocol, server, and port. We use the protocol because the setup for some protocols is extensive (e.g. https) and hopefully would not need to be repeated if we distinguish connections based on that.

Parameters:
protocol - is the protocol, e.g. "http"
server - is the server IP address, e.g. "10.32.65.1"
port - is the port to connect to, e.g. 80. Pass -1 if the default port for the protocol is desired.
authentication - is the page credentials object to use for the fetch. If null, no credentials are available.
trustStore - is the current trust store in effect for the fetch.
binNames - is the set of bins, in order, that should be used for throttling this connection. Note that the bin names for a given IP address and port MUST be the same for every connection! This must be enforced by whatever it is that builds the bins - it must do so given an IP and port.
throttleDescription - is the description of all the throttling that should take place.
connectionLimit - isthe maximum number of connections permitted.
Returns:
an IThrottledConnection object that can be used to fetch from the port.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

flushIdleConnections

public static void flushIdleConnections()
                                 throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Flush connections that have timed out from inactivity.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException