public abstract class AbstractFetchSchedule extends org.apache.hadoop.conf.Configured implements FetchSchedule
FetchSchedule.| Modifier and Type | Field and Description |
|---|---|
protected int |
defaultInterval |
protected int |
maxInterval |
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN| Constructor and Description |
|---|
AbstractFetchSchedule() |
AbstractFetchSchedule(org.apache.hadoop.conf.Configuration conf) |
| Modifier and Type | Method and Description |
|---|---|
long |
calculateLastFetchTime(CrawlDatum datum)
This method return the last fetch time of the CrawlDatum
|
CrawlDatum |
forceRefetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime,
retriesSinceFetch and page signature, so that it forces refetching.
|
CrawlDatum |
initializeSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Initialize fetch schedule related data.
|
void |
setConf(org.apache.hadoop.conf.Configuration conf) |
CrawlDatum |
setFetchSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
setPageGoneSchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages
marked as GONE.
|
CrawlDatum |
setPageRetrySchedule(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be
re-tried due to transient errors.
|
boolean |
shouldFetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long curTime)
This method provides information whether the page is suitable for
selection in the current fetchlist.
|
public AbstractFetchSchedule()
public AbstractFetchSchedule(org.apache.hadoop.conf.Configuration conf)
public void setConf(org.apache.hadoop.conf.Configuration conf)
setConf in interface org.apache.hadoop.conf.ConfigurablesetConf in class org.apache.hadoop.conf.Configuredpublic CrawlDatum initializeSchedule(org.apache.hadoop.io.Text url, CrawlDatum datum)
fetchTime and fetchInterval. The default
implementation sets the fetchTime to now, using the
default fetchInterval.initializeSchedule in interface FetchScheduleurl - URL of the page.datum - datum instance to be initialized (modified in place).public CrawlDatum setFetchSchedule(org.apache.hadoop.io.Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
fetchInterval and fetchTime on a
successfully fetched page. NOTE: this implementation resets the
retry counter - extending classes should call super.setFetchSchedule() to
preserve this behavior.setFetchSchedule in interface FetchScheduleurl - url of the pagedatum - page description to be adjusted. NOTE: this instance, passed by reference,
may be modified inside the method.prevFetchTime - previous value of fetch time, or 0 if not available.prevModifiedTime - previous value of modifiedTime, or 0 if not available.fetchTime - the latest time, when the page was recently re-fetched. Most FetchSchedule
implementations should update the value in @see CrawlDatum to something greater than this value.modifiedTime - last time the content was modified. This information comes from
the protocol implementations, or is set to < 0 if not available. Most FetchSchedule
implementations should update the value in @see CrawlDatum to this value.state - if FetchSchedule.STATUS_MODIFIED, then the content is considered to be "changed" before the
fetchTime, if FetchSchedule.STATUS_NOTMODIFIED then the content is known to be unchanged.
This information may be obtained by comparing page signatures before and after fetching. If this
is set to FetchSchedule.STATUS_UNKNOWN, then it is unknown whether the page was changed; implementations
are free to follow a sensible default behavior.public CrawlDatum setPageGoneSchedule(org.apache.hadoop.io.Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
maxInterval.setPageGoneSchedule in interface FetchScheduleurl - URL of the page.datum - datum instance to be adjusted.public CrawlDatum setPageRetrySchedule(org.apache.hadoop.io.Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
setPageRetrySchedule in interface FetchScheduleurl - URL of the page.datum - page information.prevFetchTime - previous fetch time.prevModifiedTime - previous modified time.fetchTime - current fetch time.public long calculateLastFetchTime(CrawlDatum datum)
calculateLastFetchTime in interface FetchSchedulepublic boolean shouldFetch(org.apache.hadoop.io.Text url,
CrawlDatum datum,
long curTime)
fetchTime, if it is higher than the
curTime it returns false, and true otherwise. It will also
check that fetchTime is not too remote (more than maxInterval,
in which case it lowers the interval and returns true.shouldFetch in interface FetchScheduleurl - URL of the page.datum - datum instance.curTime - reference time (usually set to the time when the
fetchlist generation process was started).public CrawlDatum forceRefetch(org.apache.hadoop.io.Text url, CrawlDatum datum, boolean asap)
forceRefetch in interface FetchScheduleurl - URL of the page.datum - datum instance.asap - if true, force refetch as soon as possible - this sets
the fetchTime to now. If false, force refetch whenever the next fetch
time is set.Copyright © 2014 The Apache Software Foundation