org.apache.manifoldcf.crawler.connectors.webcrawler
Class RobotsManager

java.lang.Object
  extended by org.apache.manifoldcf.core.database.BaseTable
      extended by org.apache.manifoldcf.crawler.connectors.webcrawler.RobotsManager

public class RobotsManager
extends org.apache.manifoldcf.core.database.BaseTable

This class manages the database table into which we write robots.txt files for hosts. The data resides in the database, as well as in cache (up to a certain point). The result is that there is a memory limited, database-backed repository of robots files that we can draw on.


Nested Class Summary
protected static class RobotsManager.HostDescription
          This is the object description for a robots host object.
protected static class RobotsManager.HostExecutor
          This is the executor object for locating robots host objects.
protected static class RobotsManager.Record
          This class represents a record in a robots.txt file.
protected static class RobotsManager.RobotsCacheClass
          Cache class for robots.
protected static class RobotsManager.RobotsData
          This is a cached data item.
 
Field Summary
static java.lang.String _rcsid
           
protected static java.lang.String expirationField
           
protected static java.lang.String hostField
           
protected static RobotsManager.RobotsCacheClass robotsCacheClass
           
protected static java.lang.String robotsField
           
 
Fields inherited from class org.apache.manifoldcf.core.database.BaseTable
dbInterface, tableName
 
Constructor Summary
RobotsManager(org.apache.manifoldcf.core.interfaces.IThreadContext tc, org.apache.manifoldcf.core.interfaces.IDBInterface database)
          Constructor.
 
Method Summary
 java.lang.Boolean checkFetchAllowed(java.lang.String userAgent, java.lang.String hostName, long currentTime, java.lang.String pathString, org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities)
          Read robots.txt data from the cache or from the database.
 void deinstall()
          Uninstall the manager.
protected static boolean doesPathMatch(java.lang.String path, int pathIndex, java.lang.String spec, int specIndex)
          Recursive method for matching specification to path.
protected static boolean doesPathMatch(java.lang.String path, java.lang.String spec)
          Check if path matches specification
protected static java.lang.String getRobotsKey(java.lang.String hostName)
          Construct a key which represents an individual host name.
 void install()
          Install the manager.
protected static java.lang.String makeReadable(java.lang.String inputString)
          Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).
protected  RobotsManager.RobotsData readRobotsData(java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities)
          Read robots data, if it exists.
 void writeRobotsData(java.lang.String hostName, long expirationTime, java.io.InputStream data)
          Write robots.txt, replacing any existing row.
 
Methods inherited from class org.apache.manifoldcf.core.database.BaseTable
addTableIndex, analyzeTable, beginTransaction, constructDistinctOnClause, constructOffsetLimitClause, constructRegexpClause, constructSubstringClause, endTransaction, getDatabaseCacheKey, getDBInterface, getMaxInClause, getMaxOrClause, getTableIndexes, getTableName, getTableSchema, getTransactionID, makeTableKey, noteModifications, performAddIndex, performAlter, performCreate, performDelete, performDrop, performInsert, performLock, performModification, performQuery, performQuery, performRemoveIndex, performUpdate, prepareRowForSave, readRow, reindexTable, signalRollback
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_rcsid

public static final java.lang.String _rcsid
See Also:
Constant Field Values

robotsCacheClass

protected static RobotsManager.RobotsCacheClass robotsCacheClass

hostField

protected static final java.lang.String hostField
See Also:
Constant Field Values

robotsField

protected static final java.lang.String robotsField
See Also:
Constant Field Values

expirationField

protected static final java.lang.String expirationField
See Also:
Constant Field Values
Constructor Detail

RobotsManager

public RobotsManager(org.apache.manifoldcf.core.interfaces.IThreadContext tc,
                     org.apache.manifoldcf.core.interfaces.IDBInterface database)
              throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Constructor. Note that one robotsmanager handle is only useful within a specific thread context, so the calling connector object logic must recreate the handle whenever the thread context changes.

Parameters:
tc - is the thread context.
database - is the database handle.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
Method Detail

install

public void install()
             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Install the manager.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

deinstall

public void deinstall()
               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Uninstall the manager.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

checkFetchAllowed

public java.lang.Boolean checkFetchAllowed(java.lang.String userAgent,
                                           java.lang.String hostName,
                                           long currentTime,
                                           java.lang.String pathString,
                                           org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities)
                                    throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Read robots.txt data from the cache or from the database.

Parameters:
hostName - is the host for which the data is desired.
currentTime - is the time of the check.
Returns:
null if the record needs to be fetched, true if fetch is allowed.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

writeRobotsData

public void writeRobotsData(java.lang.String hostName,
                            long expirationTime,
                            java.io.InputStream data)
                     throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                            java.io.IOException
Write robots.txt, replacing any existing row.

Parameters:
hostName - is the host.
expirationTime - is the time this data should expire.
data - is the robots data stream. May be null.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

getRobotsKey

protected static java.lang.String getRobotsKey(java.lang.String hostName)
Construct a key which represents an individual host name.

Parameters:
hostName - is the name of the connector.
Returns:
the cache key.

readRobotsData

protected RobotsManager.RobotsData readRobotsData(java.lang.String hostName,
                                                  org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities)
                                           throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Read robots data, if it exists.

Returns:
null if the data doesn't exist at all. Return robots data if it does.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

makeReadable

protected static java.lang.String makeReadable(java.lang.String inputString)
Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).


doesPathMatch

protected static boolean doesPathMatch(java.lang.String path,
                                       java.lang.String spec)
Check if path matches specification


doesPathMatch

protected static boolean doesPathMatch(java.lang.String path,
                                       int pathIndex,
                                       java.lang.String spec,
                                       int specIndex)
Recursive method for matching specification to path.