|
||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||
java.lang.Objectorg.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher.ThrottleBin
protected static class ThrottledFetcher.ThrottleBin
Throttles for a bin. An instance of this class keeps track of the information needed to bandwidth throttle access to a url belonging to a specific bin. In order to calculate the effective "burst" fetches per second and bytes per second, we need to have some idea what the window is. For example, a long hiatus from fetching could cause overuse of the server when fetching resumes, if the window length is too long. One solution to this problem would be to keep a list of the individual fetches as records. Then, we could "expire" a fetch by discarding the old record. However, this is quite memory consumptive for all but the smallest intervals. Another, better, solution is to hook into the start and end of individual fetches. These will, presumably, occur at the fastest possible rate without long pauses spent doing something else. The only complication is that fetches may well overlap, so we need to "reference count" the fetches to know when to reset the counters. For "fetches per second", we can simply make sure we "schedule" the next fetch at an appropriate time, rather than keep records around. The overall rate may therefore be somewhat less than the specified rate, but that's perfectly acceptable. Some notes on the algorithms used to limit server bandwidth impact ================================================================== In a single connection case, the algorithm we'd want to use works like this. On the first chunk of a series, the total length of time and the number of bytes are recorded. Then, prior to each subsequent chunk, a calculation is done which attempts to hit the bandwidth target by the end of the chunk read, using the rate of the first chunk access as a way of estimating how long it will take to fetch those next n bytes. For a multi-connection case, which this is, it's harder to either come up with a good maximum bandwidth estimate, and harder still to "hit the target", because simultaneous fetches will intrude. The strategy is therefore: 1) The first chunk of any series should proceed without interference from other connections to the same server. The goal here is to get a decent quality estimate without any possibility of overwhelming the server. 2) The bandwidth of the first chunk is treated as the "maximum bandwidth per connection". That is, if other connections are going on, we can presume that each connection will use at most the bandwidth that the first fetch took. Thus, by generating end-time estimates based on this number, we are actually being conservative and using less server bandwidth. 3) For chunks that have started but not finished, we keep track of their size and estimated elapsed time in order to schedule when new chunks from other connections can start.
| Field Summary | |
|---|---|
protected java.lang.String |
binName
This is the bin name which this throttle belongs to. |
protected boolean |
estimateInProgress
Flag indicating whether rate estimation is in progress yet |
protected boolean |
estimateValid
Flag indicating whether a rate estimate is needed |
protected java.lang.Integer |
firstChunkLock
This object is used to gate access while the first chunk is being read |
protected double |
rateEstimate
The inverse rate estimate of the first fetch, in ms/byte |
protected int |
refCount
This is the reference count for this bin (which records active references) |
protected long |
seriesStartTime
The start time of this series |
protected long |
totalBytesRead
Total actual bytes read in this series; this includes fetches in progress |
| Constructor Summary | |
|---|---|
ThrottledFetcher.ThrottleBin(java.lang.String binName)
Constructor. |
|
| Method Summary | |
|---|---|
void |
beginFetch()
Note the start of a fetch operation for a bin. |
void |
beginRead(int byteCount,
double minimumMillisecondsPerBytePerServer)
Note the start of an individual byte read of a specified size. |
boolean |
endFetch()
Note the end of a fetch operation. |
void |
endRead(int originalCount,
int actualCount)
Note the end of an individual read from the server. |
java.lang.String |
getBinName()
Get the bin name. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
protected java.lang.String binName
protected int refCount
protected double rateEstimate
protected boolean estimateValid
protected boolean estimateInProgress
protected long seriesStartTime
protected long totalBytesRead
protected java.lang.Integer firstChunkLock
| Constructor Detail |
|---|
public ThrottledFetcher.ThrottleBin(java.lang.String binName)
| Method Detail |
|---|
public java.lang.String getBinName()
public void beginFetch()
throws java.lang.InterruptedException
java.lang.InterruptedException
public void beginRead(int byteCount,
double minimumMillisecondsPerBytePerServer)
throws java.lang.InterruptedException
java.lang.InterruptedException
public void endRead(int originalCount,
int actualCount)
public boolean endFetch()
|
||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||