org.apache.manifoldcf.crawler.connectors.rss
Class Robots

java.lang.Object
  extended by org.apache.manifoldcf.crawler.connectors.rss.Robots

public class Robots
extends java.lang.Object

This class is a cache of a specific robots data. It is loaded and fetched according to standard robots rules; namely, caching for up to 24 hrs, format and parsing rules consistent with http://www.robotstxt.org/wc/robots.html. The apache Httpclient is used to fetch the robots files, when necessary. An instance of this class should be constructed statically in order for the caching properties to work to maximum advantage.


Nested Class Summary
protected  class Robots.Host
          This class maintains status for a given host.
protected static class Robots.Record
          This class represents a record in a robots.txt file.
 
Field Summary
static java.lang.String _rcsid
           
protected  java.util.Map cache
          This is the cache hash - which is keyed by the protocol/host/port, and has a Host object as the value.
protected  ThrottledFetcher fetcher
          Fetcher to use to get the data from wherever
protected  int refCount
          Reference count
protected static java.lang.String ROBOT_CONNECTION_TYPE
          Robots connection type value
protected static java.lang.String ROBOT_FILE_NAME
          Robot file name value
protected static int ROBOT_TIMEOUT_MILLISECONDS
          Robots fetch timeout value
 
Constructor Summary
Robots(ThrottledFetcher fetcher)
          Constructor.
 
Method Summary
protected static boolean doesPathMatch(java.lang.String path, int pathIndex, java.lang.String spec, int specIndex)
          Recursive method for matching specification to path.
protected static boolean doesPathMatch(java.lang.String path, java.lang.String spec)
          Check if path matches specification
 boolean isFetchAllowed(java.lang.String protocol, int port, java.lang.String hostName, java.lang.String pathString, java.lang.String userAgent, java.lang.String from, double minimumMillisecondsPerBytePerServer, int maxOpenConnectionsPerServer, long minimumMillisecondsPerFetchPerServer, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities, int connectionLimit)
          Decide whether a specific robot can crawl a specific URL.
protected static java.lang.String makeReadable(java.lang.String inputString)
          Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).
 void noteConnectionEstablished()
          Note that a connection has been established.
 void noteConnectionReleased()
          Note that a connection has been released, and free resources if no reason to retain them.
 void poll()
          Clean idle stuff out of cache
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_rcsid

public static final java.lang.String _rcsid
See Also:
Constant Field Values

ROBOT_TIMEOUT_MILLISECONDS

protected static final int ROBOT_TIMEOUT_MILLISECONDS
Robots fetch timeout value

See Also:
Constant Field Values

ROBOT_CONNECTION_TYPE

protected static final java.lang.String ROBOT_CONNECTION_TYPE
Robots connection type value

See Also:
Constant Field Values

ROBOT_FILE_NAME

protected static final java.lang.String ROBOT_FILE_NAME
Robot file name value

See Also:
Constant Field Values

fetcher

protected ThrottledFetcher fetcher
Fetcher to use to get the data from wherever


refCount

protected int refCount
Reference count


cache

protected java.util.Map cache
This is the cache hash - which is keyed by the protocol/host/port, and has a Host object as the value.

Constructor Detail

Robots

public Robots(ThrottledFetcher fetcher)
Constructor.

Method Detail

noteConnectionEstablished

public void noteConnectionEstablished()
Note that a connection has been established.


noteConnectionReleased

public void noteConnectionReleased()
Note that a connection has been released, and free resources if no reason to retain them.


poll

public void poll()
Clean idle stuff out of cache


isFetchAllowed

public boolean isFetchAllowed(java.lang.String protocol,
                              int port,
                              java.lang.String hostName,
                              java.lang.String pathString,
                              java.lang.String userAgent,
                              java.lang.String from,
                              double minimumMillisecondsPerBytePerServer,
                              int maxOpenConnectionsPerServer,
                              long minimumMillisecondsPerFetchPerServer,
                              java.lang.String proxyHost,
                              int proxyPort,
                              java.lang.String proxyAuthDomain,
                              java.lang.String proxyAuthUsername,
                              java.lang.String proxyAuthPassword,
                              org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities,
                              int connectionLimit)
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                              org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Decide whether a specific robot can crawl a specific URL. A ServiceInterruption exception is thrown if the fetch itself fails in a transient way. A permanent failure (such as an invalid URL) with throw a ManifoldCFException.

Parameters:
userAgent - is the user-agent string used by the robot.
from - is the email address.
protocol - is the name of the protocol (e.g. "http")
port - is the port number (-1 being the default for the protocol)
hostName - is the fqdn of the host
pathString - is the path (non-query) part of the URL
Returns:
true if fetch is allowed, false otherwise.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

makeReadable

protected static java.lang.String makeReadable(java.lang.String inputString)
Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).


doesPathMatch

protected static boolean doesPathMatch(java.lang.String path,
                                       java.lang.String spec)
Check if path matches specification


doesPathMatch

protected static boolean doesPathMatch(java.lang.String path,
                                       int pathIndex,
                                       java.lang.String spec,
                                       int specIndex)
Recursive method for matching specification to path.