public abstract class RobotRulesParser extends Object implements org.apache.hadoop.conf.Configurable
robots.txt files.
It emits SimpleRobotRules objects, which describe the download permissions
as described in SimpleRobotRulesParser.| Modifier and Type | Field and Description |
|---|---|
protected String |
agentNames |
protected static Hashtable<String,crawlercommons.robots.BaseRobotRules> |
CACHE |
static crawlercommons.robots.BaseRobotRules |
EMPTY_RULES
A
BaseRobotRules object appropriate for use
when the robots.txt file is empty or missing;
all requests are allowed. |
static crawlercommons.robots.BaseRobotRules |
FORBID_ALL_RULES
A
BaseRobotRules object appropriate for use when the
robots.txt file is not fetched due to a 403/Forbidden
response; all requests are disallowed. |
static org.slf4j.Logger |
LOG |
| Constructor and Description |
|---|
RobotRulesParser() |
RobotRulesParser(org.apache.hadoop.conf.Configuration conf) |
| Modifier and Type | Method and Description |
|---|---|
org.apache.hadoop.conf.Configuration |
getConf()
Get the
Configuration object |
crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol protocol,
org.apache.hadoop.io.Text url) |
abstract crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol protocol,
URL url) |
static void |
main(String[] argv)
command-line main for testing
|
crawlercommons.robots.BaseRobotRules |
parseRules(String url,
byte[] content,
String contentType,
String robotName)
Parses the robots content using the
SimpleRobotRulesParser from crawler commons |
void |
setConf(org.apache.hadoop.conf.Configuration conf)
Set the
Configuration object |
public static final org.slf4j.Logger LOG
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
BaseRobotRules object appropriate for use
when the robots.txt file is empty or missing;
all requests are allowed.public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
BaseRobotRules object appropriate for use when the
robots.txt file is not fetched due to a 403/Forbidden
response; all requests are disallowed.protected String agentNames
public RobotRulesParser()
public RobotRulesParser(org.apache.hadoop.conf.Configuration conf)
public void setConf(org.apache.hadoop.conf.Configuration conf)
Configuration objectsetConf in interface org.apache.hadoop.conf.Configurablepublic org.apache.hadoop.conf.Configuration getConf()
Configuration objectgetConf in interface org.apache.hadoop.conf.Configurablepublic crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
SimpleRobotRulesParser from crawler commonsurl - A string containing urlcontent - Contents of the robots file in a byte arraycontentType - TherobotName - A string containing value ofpublic crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, org.apache.hadoop.io.Text url)
public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url)
public static void main(String[] argv)
Copyright © 2014 The Apache Software Foundation