public class URLMetaScoringFilter extends org.apache.hadoop.conf.Configured implements ScoringFilter
URLMetaIndexingFilterX_POINT_ID| Constructor and Description |
|---|
URLMetaScoringFilter() |
| Modifier and Type | Method and Description |
|---|---|
CrawlDatum |
distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the parseData object.
|
float |
generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort)
Boilerplate
|
org.apache.hadoop.conf.Configuration |
getConf()
Boilerplate
|
float |
indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Boilerplate
|
void |
initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Boilerplate
|
void |
injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
Boilerplate
|
void |
passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
Takes the metadata, which was lumped inside the content, and replicates it
within your parse data.
|
void |
passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
Takes the metadata, specified in your "urlmeta.tags" property, from the
datum object and injects it into the content.
|
void |
setConf(org.apache.hadoop.conf.Configuration conf)
handles conf assignment and pulls the value assignment from the
"urlmeta.tags" property
|
void |
updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Boilerplate
|
public CrawlDatum distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl, ParseData parseData, Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount) throws ScoringFilterException
distributeScoreToOutlinks in interface ScoringFilterfromUrl - url of the source pageparseData - ParseData instance, which stores relevant score value(s)
in its metadata. NOTE: filters may modify this in-place, all changes will
be persisted.targets - <url, CrawlDatum> pairs. NOTE: filters can modify this in-place,
all changes will be persisted.adjust - a CrawlDatum instance, initially null, which implementations
may use to pass adjustment values to the original CrawlDatum. When creating
this instance, set its status to CrawlDatum.STATUS_LINKED.allCount - number of all collected outlinks from the source pageCrawlDatum.STATUS_LINKED, which contains adjustments
to be applied to the original CrawlDatum score(s) and metadata. This can
be null if not needed.ScoringFilterExceptionScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text, org.apache.nutch.parse.ParseData, java.util.Collection<java.util.Map.Entry<org.apache.hadoop.io.Text, org.apache.nutch.crawl.CrawlDatum>>, org.apache.nutch.crawl.CrawlDatum, int)public void passScoreBeforeParsing(org.apache.hadoop.io.Text url,
CrawlDatum datum,
Content content)
passScoreBeforeParsing in interface ScoringFilterurl - url of the pagedatum - source datum. NOTE: modifications to this value are not persisted.content - instance of content. Implementations may modify this
in-place, primarily by setting some metadata properties.ScoringFilter.passScoreBeforeParsing(org.apache.hadoop.io.Text, org.apache.nutch.crawl.CrawlDatum, org.apache.nutch.protocol.Content),
passScoreAfterParsing(org.apache.hadoop.io.Text, org.apache.nutch.protocol.Content, org.apache.nutch.parse.Parse)public void passScoreAfterParsing(org.apache.hadoop.io.Text url,
Content content,
Parse parse)
passScoreAfterParsing in interface ScoringFilterurl - page urlcontent - original content. NOTE: modifications to this value are not persisted.parse - target instance to copy the score information to. Implementations
may modify this in-place, primarily by setting some metadata properties.passScoreBeforeParsing(org.apache.hadoop.io.Text, org.apache.nutch.crawl.CrawlDatum, org.apache.nutch.protocol.Content),
ScoringFilter.passScoreAfterParsing(org.apache.hadoop.io.Text, org.apache.nutch.protocol.Content, org.apache.nutch.parse.Parse)public float generatorSortValue(org.apache.hadoop.io.Text url,
CrawlDatum datum,
float initSort)
throws ScoringFilterException
generatorSortValue in interface ScoringFilterurl - url of the pagedatum - page's datum, should not be modifiedinitSort - initial sort value, or a value from previous filters in chainScoringFilterExceptionpublic float indexerScore(org.apache.hadoop.io.Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
throws ScoringFilterException
indexerScore in interface ScoringFilterurl - url of the pagedoc - Lucene document. NOTE: this already contains all information collected
by indexing filters. Implementations may modify this instance, in order to store/remove
some information.dbDatum - current page from CrawlDb. NOTE: changes made to this instance
are not persisted.fetchDatum - datum from FetcherOutput (containing among others the fetching status)parse - parsing result. NOTE: changes made to this instance are not persisted.inlinks - current inlinks from LinkDb. NOTE: changes made to this instance are
not persisted.initScore - initial boost value for the Lucene document.ScoringFilterExceptionpublic void initialScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
throws ScoringFilterException
initialScore in interface ScoringFilterurl - url of the pagedatum - new datum. Filters will modify it in-place.ScoringFilterExceptionpublic void injectedScore(org.apache.hadoop.io.Text url,
CrawlDatum datum)
throws ScoringFilterException
injectedScore in interface ScoringFilterurl - url of the pagedatum - new datum. Filters will modify it in-place.ScoringFilterExceptionpublic void updateDbScore(org.apache.hadoop.io.Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
throws ScoringFilterException
updateDbScore in interface ScoringFilterurl - url of the pageold - original datum, with original score. May be null if this is a newly
discovered page. If not null, filters should use score values from this parameter
as the starting values - the datum parameter may contain values that are
no longer valid, if other updates occured between generation and this update.datum - the new datum, with the original score saved at the time when
fetchlist was generated. Filters should update this in-place, and it will be saved in
the crawldb.inlinked - (partial) list of CrawlDatum-s (with their scores) from
links pointing to this page, found in the current update batch.ScoringFilterExceptionpublic void setConf(org.apache.hadoop.conf.Configuration conf)
setConf in interface org.apache.hadoop.conf.ConfigurablesetConf in class org.apache.hadoop.conf.Configuredpublic org.apache.hadoop.conf.Configuration getConf()
getConf in interface org.apache.hadoop.conf.ConfigurablegetConf in class org.apache.hadoop.conf.ConfiguredCopyright © 2014 The Apache Software Foundation