org.apache.manifoldcf.crawler.connectors.rss
Class Robots.Host

java.lang.Object
  extended by org.apache.manifoldcf.crawler.connectors.rss.Robots.Host
Enclosing class:
Robots

protected class Robots.Host
extends java.lang.Object

This class maintains status for a given host. There's an instance of this class for each host in the robots cache.


Field Summary
protected  int checkingRobots
          This will be set to nonzero if the robots structure is currently in use
protected  java.lang.String hostName
          Host name
protected  long invalidTime
          Timestamp.
protected  boolean isValid
          This flag describes whether or not the host record is valid yet.
protected  int port
          Port
protected  java.lang.String protocol
          Protocol
protected  boolean readingRobots
          This will be set to "true" if the robots.txt for this host is in the process of being read.
protected  java.util.ArrayList records
          This is the list of robots records for the host, or null if no robots.txt found.
 
Constructor Summary
Robots.Host(java.lang.String protocol, int port, java.lang.String hostName)
          Constructor.
 
Method Summary
 boolean canBeFlushed(long currentTime)
          Check if the current record can be flushed.
 boolean isFetchAllowed(long currentTime, java.lang.String pathString, java.lang.String userAgent, java.lang.String from, double minimumMillisecondsPerBytePerServer, int maxOpenConnectionsPerServer, long minimumMillisecondsPerFetchPerServer, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities, int connectionLimit)
          Check a given path string against this host's robots file.
protected  void makeValid(long currentTime, java.lang.String userAgent, java.lang.String from, double minimumMillisecondsPerBytePerServer, int maxOpenConnectionsPerServer, long minimumMillisecondsPerFetchPerServer, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities, int connectionLimit)
          Initialize the record.
protected  void parseRobotsTxt(java.io.BufferedReader r, java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities)
          Parse the robots.txt file using a reader.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

protocol

protected java.lang.String protocol
Protocol


port

protected int port
Port


hostName

protected java.lang.String hostName
Host name


invalidTime

protected long invalidTime
Timestamp. This is the time that the cache record becomes invalid.


isValid

protected boolean isValid
This flag describes whether or not the host record is valid yet.


records

protected java.util.ArrayList records
This is the list of robots records for the host, or null if no robots.txt found.


readingRobots

protected boolean readingRobots
This will be set to "true" if the robots.txt for this host is in the process of being read.


checkingRobots

protected int checkingRobots
This will be set to nonzero if the robots structure is currently in use

Constructor Detail

Robots.Host

public Robots.Host(java.lang.String protocol,
                   int port,
                   java.lang.String hostName)
Constructor.

Method Detail

isFetchAllowed

public boolean isFetchAllowed(long currentTime,
                              java.lang.String pathString,
                              java.lang.String userAgent,
                              java.lang.String from,
                              double minimumMillisecondsPerBytePerServer,
                              int maxOpenConnectionsPerServer,
                              long minimumMillisecondsPerFetchPerServer,
                              java.lang.String proxyHost,
                              int proxyPort,
                              java.lang.String proxyAuthDomain,
                              java.lang.String proxyAuthUsername,
                              java.lang.String proxyAuthPassword,
                              org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities,
                              int connectionLimit)
                       throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption,
                              org.apache.manifoldcf.core.interfaces.ManifoldCFException
Check a given path string against this host's robots file.

Parameters:
currentTime - is the current time in milliseconds since epoch.
pathString - is the path string to check.
Returns:
true if crawling is allowed, false otherwise.
Throws:
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException

canBeFlushed

public boolean canBeFlushed(long currentTime)
Check if the current record can be flushed. This is not quite the same as whether the record is valid, since a not-yet-valid record still should not be flushed when there is activity going on with that record!


makeValid

protected void makeValid(long currentTime,
                         java.lang.String userAgent,
                         java.lang.String from,
                         double minimumMillisecondsPerBytePerServer,
                         int maxOpenConnectionsPerServer,
                         long minimumMillisecondsPerFetchPerServer,
                         java.lang.String proxyHost,
                         int proxyPort,
                         java.lang.String proxyAuthDomain,
                         java.lang.String proxyAuthUsername,
                         java.lang.String proxyAuthPassword,
                         java.lang.String hostName,
                         org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities,
                         int connectionLimit)
                  throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption,
                         org.apache.manifoldcf.core.interfaces.ManifoldCFException
Initialize the record. This method reads the robots file on the specified protocol/host/port, and parses it according to the rules.

Throws:
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException

parseRobotsTxt

protected void parseRobotsTxt(java.io.BufferedReader r,
                              java.lang.String hostName,
                              org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities)
                       throws java.io.IOException,
                              org.apache.manifoldcf.core.interfaces.ManifoldCFException
Parse the robots.txt file using a reader. Is NOT expected to close the stream.

Throws:
java.io.IOException
org.apache.manifoldcf.core.interfaces.ManifoldCFException