org.apache.manifoldcf.crawler.connectors.webcrawler
Class WebcrawlerConnector

java.lang.Object
  extended by org.apache.manifoldcf.core.connector.BaseConnector
      extended by org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
          extended by org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector
All Implemented Interfaces:
org.apache.manifoldcf.core.interfaces.IConnector, org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector

public class WebcrawlerConnector
extends org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector

This is the Web Crawler implementation of the IRepositoryConnector interface. This connector may be superceded by one that calls out to python, or by a entirely python Connector Framework, depending on how the winds blow.


Nested Class Summary
protected static class WebcrawlerConnector.CanonicalizationPolicies
          Class representing a list of canonicalization rules
protected static class WebcrawlerConnector.CanonicalizationPolicy
          Class representing a URL regular expression match, for the purposes of determining canonicalization policy
protected static class WebcrawlerConnector.DocumentURLFilter
          This class describes the url filtering information obtained from a digested DocumentSpecification.
protected  class WebcrawlerConnector.FeedContextClass
           
protected  class WebcrawlerConnector.FeedItemContextClass
           
protected  class WebcrawlerConnector.FindHandler
          This class is used to discover links in a session login context
protected  class WebcrawlerConnector.FindHTMLFormHandler
          This class is the handler for HTML form parsing during state transitions
protected  class WebcrawlerConnector.FindHTMLHrefHandler
          This class is the handler for HTML parsing during state transitions
protected  class WebcrawlerConnector.FindPreferredRedirectionHandler
          This class is the handler for redirection handling during state transitions
protected  class WebcrawlerConnector.FindRedirectionHandler
          This class is the handler for redirection parsing during state transitions
protected static class WebcrawlerConnector.NameValue
          Name/value class
protected  class WebcrawlerConnector.OuterContextClass
          This class handles the outermost XML context for the feed document.
protected  class WebcrawlerConnector.ProcessActivityHTMLHandler
          Class that describes HTML handling
protected  class WebcrawlerConnector.ProcessActivityLinkHandler
          This class is the handler for links that get added into a IProcessActivity object.
protected  class WebcrawlerConnector.ProcessActivityRedirectionHandler
          Class that describes redirection handling
protected  class WebcrawlerConnector.ProcessActivityXMLHandler
          Class that describes XML handling
protected  class WebcrawlerConnector.RDFContextClass
           
protected  class WebcrawlerConnector.RDFItemContextClass
           
protected  class WebcrawlerConnector.RSSChannelContextClass
           
protected  class WebcrawlerConnector.RSSContextClass
           
protected  class WebcrawlerConnector.RSSItemContextClass
           
 
Field Summary
static java.lang.String _rcsid
           
static java.lang.String ACTIVITY_FETCH
           
static java.lang.String ACTIVITY_LOGON_END
           
static java.lang.String ACTIVITY_LOGON_START
           
static java.lang.String ACTIVITY_ROBOTSPARSE
           
protected static DataCache cache
          This is where we keep data around between the getVersions() phase and the processDocuments() phase.
protected  int connectionTimeoutMilliseconds
          Connection timeout, milliseconds.
protected  CookieManager cookieManager
          The cookie manager used by this instance
protected  CredentialsDescription credentialsDescription
          The credentials description
protected  DNSManager dnsManager
          The DNS manager currently used by this instance
protected static java.lang.String FETCH_LOGIN
           
protected static java.lang.String FETCH_ROBOTS
           
protected static java.lang.String FETCH_STANDARD
           
protected  java.lang.String from
          The email address for this connector instance
protected static java.lang.String[] interestingMimeTypeArray
          This represents a list of the mime types that this connector knows how to extract links from.
protected static java.util.Map interestingMimeTypeMap
           
protected  boolean isInitialized
          This flag is set when the instance has been initialized
static java.lang.String REL_LINK
           
static java.lang.String REL_REDIRECT
           
protected static int RESULT_NO_DOCUMENT
           
protected static int RESULT_NO_VERSION
           
protected static int RESULT_RETRY_DOCUMENT
           
protected static int RESULT_VERSION_NEEDED
           
protected static int RESULTSTATUS_FALSE
           
protected static int RESULTSTATUS_NOTYETDETERMINED
           
protected static int RESULTSTATUS_TRUE
           
protected static int ROBOTS_ALL
           
protected static int ROBOTS_DATA
           
protected static int ROBOTS_NONE
           
protected  RobotsManager robotsManager
          The robots manager currently used by this instance
protected  int robotsUsage
          Robots usage flag
protected static int SESSIONSTATE_LOGIN
          We're in 'login mode'
protected static int SESSIONSTATE_NORMAL
          Normal fetch of content document.
protected  int socketTimeoutMilliseconds
          Socket timeout, milliseconds
protected  ThrottleDescription throttleDescription
          The throttle description
protected  TrustsDescription trustsDescription
          The trusts description
protected static java.util.Map understoodProtocols
           
protected  java.lang.String userAgent
          The user-agent for this connector instance
 
Fields inherited from class org.apache.manifoldcf.core.connector.BaseConnector
currentContext, params
 
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
JOBMODE_CONTINUOUS, JOBMODE_ONCEONLY, MODEL_ADD, MODEL_ADD_CHANGE, MODEL_ADD_CHANGE_DELETE, MODEL_ALL, MODEL_PARTIAL
 
Constructor Summary
WebcrawlerConnector()
          Constructor.
 
Method Summary
 void addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec, long startTime, long endTime)
          Queue "seed" documents.
protected  java.lang.String[] calculateDocumentEvents(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String documentIdentifier)
          Calculate events that should be associated with a document.
 java.lang.String check()
          Check status of connection.
protected  int checkFetchAllowed(java.lang.String documentIdentifier, java.lang.String protocol, java.lang.String hostIPAddress, int port, PageCredentials credential, org.apache.manifoldcf.core.interfaces.IKeystoreManager trustStore, java.lang.String hostName, java.lang.String[] binNames, long currentTime, java.lang.String pathString, org.apache.manifoldcf.crawler.interfaces.IVersionActivity versionActivities, int connectionLimit)
          Check robots to see if fetch is allowed.
 void clearThreadContext()
          Clear out any state information specific to a given thread.
protected static void compileList(java.util.ArrayList output, java.util.ArrayList input)
          Compile all regexp entries in the passed in list, and add them to the output list.
 void deinstall(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
          Uninstall the connector.
 void disconnect()
          Close the connection.
protected  java.lang.String doCanonicalization(WebcrawlerConnector.DocumentURLFilter filter, java.net.URI url)
          Code to canonicalize a URL.
protected  boolean extractLinks(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, WebcrawlerConnector.DocumentURLFilter filter)
          Code to extract links from an already-fetched document.
protected  FormData findHTMLForm(java.lang.String currentURI, LoginParameters lp)
          Find matching HTML form data, if present.
protected  java.lang.String findHTMLLinkURI(java.lang.String currentURI, LoginParameters lp)
          Find HTML link URI, if present, making sure specified preference is matched.
protected static java.util.ArrayList findMetadata(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
          Read a document specification to yield a map of name/value pairs for metadata
protected  java.lang.String findPreferredRedirectionURI(java.lang.String currentURI, LoginParameters lp)
          Find a preferred redirection URI, if it exists
protected  java.lang.String findRedirectionURI(java.lang.String currentURI)
          Find a redirection URI, if it exists
protected static java.lang.String[] getAcls(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
          Grab forced acl out of document specification.
 java.lang.String[] getActivitiesList()
          Return the list of activities that this connector supports (i.e.
 java.lang.String[] getBinNames(java.lang.String documentIdentifier)
          Get the bin name string for a document identifier.
 int getConnectorModel()
          Tell the world what model this connector uses for getDocumentIdentifiers().
 java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers, java.lang.String[] oldVersions, org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec, int jobMode, boolean usesDefaultAuthority)
          Get document versions given an array of document identifiers.
 java.lang.String getJSPFolder()
          Return the path for the UI interface JSP elements.
 int getMaxDocumentRequest()
          Get the maximum number of documents to amalgamate together into one batch, for this connector.
protected  PageCredentials getPageCredential(java.lang.String documentIdentifier)
          Get the page credentials for a given document identifier (URL)
 java.lang.String[] getRelationshipTypes()
          Return the list of relationship types that this connector recognizes.
protected  SequenceCredentials getSequenceCredential(java.lang.String documentIdentifier)
          Get the sequence credentials for a given document identifier (URL)
protected  void getSession()
          Start a session
protected  org.apache.manifoldcf.core.interfaces.IKeystoreManager getTrustStore(java.lang.String documentIdentifier)
          Get the trust store for a given document identifier (URL)
protected  void handleHTML(java.lang.String documentURI, IHTMLHandler handler)
          Handle document references from HTML
protected  void handleRedirects(java.lang.String documentURI, IRedirectionHandler handler)
          Handle extracting the redirect link from a redirect response.
protected  void handleXML(java.lang.String documentURI, IXMLHandler handler)
          Handle document references from XML.
 void install(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
          Install the connector.
protected  boolean isContentInteresting(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities, java.lang.String documentIdentifier, int response, java.lang.String contentType)
          Code to check if data is interesting, based on response code and content type.
protected  boolean isDataIngestable(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities, java.lang.String documentIdentifier)
          Code to check if an already-fetched document should be ingested.
protected  boolean isDocumentText(java.lang.String documentURI)
          Is the document text, as far as we can tell?
protected static boolean isStrange(byte x)
          Check if character is not typical ASCII.
protected static boolean isText(byte[] beginChunk, int chunkLength)
          Test to see if a document is text or not.
protected static boolean isWhiteSpace(byte x)
          Check if a byte is a whitespace character.
protected  int lookupIPAddress(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities, java.lang.String hostName, long currentTime, java.lang.StringBuffer ipAddressBuffer)
          Look up an ipaddress given a non-canonical host name.
protected  java.lang.String makeDNSEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String hostNameKey)
          Calculate the event name for DNS access.
protected  java.lang.String makeDocumentIdentifier(java.lang.String parentIdentifier, java.lang.String rawURL, WebcrawlerConnector.DocumentURLFilter filter)
          Convert an absolute or relative URL to a document identifier.
protected  java.lang.String makeRobotsEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity versionActivities, java.lang.String robotsKey)
          Construct a name for the global web-connector robots event.
protected static java.lang.String makeRobotsKey(java.lang.String protocol, java.lang.String hostName, int port)
          Construct the robots key for a host.
protected  java.lang.String makeSessionLoginEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities, java.lang.String sequenceKey)
          Calculate the event name for session login.
 void outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.lang.String tabName)
          Output the configuration body section.
 void outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.util.ArrayList tabsArray)
          Output the configuration header section.
 void outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds, java.lang.String tabName)
          Output the specification body section.
 void outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds, java.util.ArrayList tabsArray)
          Output the specification header section.
protected static void pack(java.lang.StringBuffer output, java.lang.String value, char delimiter)
          Stuffer for packing a single string with an end delimiter
protected static void packFixedList(java.lang.StringBuffer output, java.lang.String[] values, char delimiter)
          Stuffer for packing lists of fixed length
protected static void packList(java.lang.StringBuffer output, java.util.ArrayList values, char delimiter)
          Stuffer for packing lists of variable length
protected static void packList(java.lang.StringBuffer output, java.lang.String[] values, char delimiter)
          Another stuffer for packing lists of variable length
 void poll()
          This method is periodically called for all connectors that are connected but not in active use.
 java.lang.String processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
          Process a configuration post.
 void processDocuments(java.lang.String[] documentIdentifiers, java.lang.String[] versions, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec, boolean[] scanOnly)
          Process a set of documents.
 java.lang.String processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
          Process a specification post.
 void releaseDocumentVersions(java.lang.String[] documentIdentifiers, java.lang.String[] versions)
          Free a set of documents.
protected static java.util.ArrayList stringToArray(java.lang.String input)
          Read a string as a sequence of individual expressions, urls, etc.
protected static int unpack(java.lang.StringBuffer sb, java.lang.String value, int startPosition, char delimiter)
          Unstuffer for the above.
protected static int unpackFixedList(java.lang.String[] output, java.lang.String value, int startPosition, char delimiter)
          Unstuffer for unpacking lists of fixed length
protected static int unpackList(java.util.ArrayList output, java.lang.String value, int startPosition, char delimiter)
          Unstuffer for unpacking lists of variable length.
 void viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
          View configuration.
 void viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
          View specification.
 
Methods inherited from class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
addSeedDocuments, getDocumentIdentifiers, getDocumentIdentifiers, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getRemainingDocumentIdentifiers, processDocuments, requestInfo
 
Methods inherited from class org.apache.manifoldcf.core.connector.BaseConnector
connect, getConfiguration, setThreadContext
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.manifoldcf.core.interfaces.IConnector
connect, getConfiguration, setThreadContext
 

Field Detail

_rcsid

public static final java.lang.String _rcsid
See Also:
Constant Field Values

RESULTSTATUS_FALSE

protected static final int RESULTSTATUS_FALSE
See Also:
Constant Field Values

RESULTSTATUS_TRUE

protected static final int RESULTSTATUS_TRUE
See Also:
Constant Field Values

RESULTSTATUS_NOTYETDETERMINED

protected static final int RESULTSTATUS_NOTYETDETERMINED
See Also:
Constant Field Values

interestingMimeTypeArray

protected static final java.lang.String[] interestingMimeTypeArray
This represents a list of the mime types that this connector knows how to extract links from. Documents that are indexable are described by the output connector.


interestingMimeTypeMap

protected static final java.util.Map interestingMimeTypeMap

understoodProtocols

protected static final java.util.Map understoodProtocols

ROBOTS_NONE

protected static final int ROBOTS_NONE
See Also:
Constant Field Values

ROBOTS_DATA

protected static final int ROBOTS_DATA
See Also:
Constant Field Values

ROBOTS_ALL

protected static final int ROBOTS_ALL
See Also:
Constant Field Values

REL_LINK

public static final java.lang.String REL_LINK
See Also:
Constant Field Values

REL_REDIRECT

public static final java.lang.String REL_REDIRECT
See Also:
Constant Field Values

ACTIVITY_FETCH

public static final java.lang.String ACTIVITY_FETCH
See Also:
Constant Field Values

ACTIVITY_ROBOTSPARSE

public static final java.lang.String ACTIVITY_ROBOTSPARSE
See Also:
Constant Field Values

ACTIVITY_LOGON_START

public static final java.lang.String ACTIVITY_LOGON_START
See Also:
Constant Field Values

ACTIVITY_LOGON_END

public static final java.lang.String ACTIVITY_LOGON_END
See Also:
Constant Field Values

FETCH_ROBOTS

protected static final java.lang.String FETCH_ROBOTS
See Also:
Constant Field Values

FETCH_STANDARD

protected static final java.lang.String FETCH_STANDARD
See Also:
Constant Field Values

FETCH_LOGIN

protected static final java.lang.String FETCH_LOGIN
See Also:
Constant Field Values

robotsUsage

protected int robotsUsage
Robots usage flag


userAgent

protected java.lang.String userAgent
The user-agent for this connector instance


from

protected java.lang.String from
The email address for this connector instance


connectionTimeoutMilliseconds

protected int connectionTimeoutMilliseconds
Connection timeout, milliseconds.


socketTimeoutMilliseconds

protected int socketTimeoutMilliseconds
Socket timeout, milliseconds


throttleDescription

protected ThrottleDescription throttleDescription
The throttle description


credentialsDescription

protected CredentialsDescription credentialsDescription
The credentials description


trustsDescription

protected TrustsDescription trustsDescription
The trusts description


robotsManager

protected RobotsManager robotsManager
The robots manager currently used by this instance


dnsManager

protected DNSManager dnsManager
The DNS manager currently used by this instance


cookieManager

protected CookieManager cookieManager
The cookie manager used by this instance


isInitialized

protected boolean isInitialized
This flag is set when the instance has been initialized


cache

protected static DataCache cache
This is where we keep data around between the getVersions() phase and the processDocuments() phase.


SESSIONSTATE_NORMAL

protected static final int SESSIONSTATE_NORMAL
Normal fetch of content document. (For all we know, we're logged in already).

See Also:
Constant Field Values

SESSIONSTATE_LOGIN

protected static final int SESSIONSTATE_LOGIN
We're in 'login mode'

See Also:
Constant Field Values

RESULT_NO_DOCUMENT

protected static final int RESULT_NO_DOCUMENT
See Also:
Constant Field Values

RESULT_NO_VERSION

protected static final int RESULT_NO_VERSION
See Also:
Constant Field Values

RESULT_VERSION_NEEDED

protected static final int RESULT_VERSION_NEEDED
See Also:
Constant Field Values

RESULT_RETRY_DOCUMENT

protected static final int RESULT_RETRY_DOCUMENT
See Also:
Constant Field Values
Constructor Detail

WebcrawlerConnector

public WebcrawlerConnector()
Constructor.

Method Detail

getConnectorModel

public int getConnectorModel()
Tell the world what model this connector uses for getDocumentIdentifiers(). This must return a model value as specified above.

Specified by:
getConnectorModel in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getConnectorModel in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Returns:
the model type value.

getJSPFolder

public java.lang.String getJSPFolder()
Return the path for the UI interface JSP elements. These JSP's must be provided to allow the connector to be configured, and to permit it to present document filtering specification information in the UI. This method should return the name of the folder, under the /connectors/ area, where the appropriate JSP's can be found. The name should NOT have a slash in it.

Returns:
the folder part

install

public void install(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Install the connector. This method is called to initialize persistent storage for the connector, such as database tables etc. It is called when the connector is registered.

Specified by:
install in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
install in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the current thread context.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

deinstall

public void deinstall(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Uninstall the connector. This method is called to remove persistent storage for the connector, such as database tables etc. It is called when the connector is deregistered.

Specified by:
deinstall in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
deinstall in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the current thread context.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

getActivitiesList

public java.lang.String[] getActivitiesList()
Return the list of activities that this connector supports (i.e. writes into the log).

Specified by:
getActivitiesList in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getActivitiesList in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Returns:
the list.

getRelationshipTypes

public java.lang.String[] getRelationshipTypes()
Return the list of relationship types that this connector recognizes.

Specified by:
getRelationshipTypes in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getRelationshipTypes in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Returns:
the list.

clearThreadContext

public void clearThreadContext()
Clear out any state information specific to a given thread. This method is called when this object is returned to the connection pool.

Specified by:
clearThreadContext in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
clearThreadContext in class org.apache.manifoldcf.core.connector.BaseConnector

getSession

protected void getSession()
                   throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Start a session

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

poll

public void poll()
          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
This method is periodically called for all connectors that are connected but not in active use.

Specified by:
poll in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
poll in class org.apache.manifoldcf.core.connector.BaseConnector
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

check

public java.lang.String check()
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Check status of connection.

Specified by:
check in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
check in class org.apache.manifoldcf.core.connector.BaseConnector
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

disconnect

public void disconnect()
                throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Close the connection. Call this before discarding the repository connector.

Specified by:
disconnect in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
disconnect in class org.apache.manifoldcf.core.connector.BaseConnector
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

getBinNames

public java.lang.String[] getBinNames(java.lang.String documentIdentifier)
Get the bin name string for a document identifier. The bin name describes the queue to which the document will be assigned for throttling purposes. Throttling controls the rate at which items in a given queue are fetched; it does not say anything about the overall fetch rate, which may operate on multiple queues or bins. For example, if you implement a web crawler, a good choice of bin name would be the server name, since that is likely to correspond to a real resource that will need real throttle protection.

Specified by:
getBinNames in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getBinNames in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
documentIdentifier - is the document identifier.
Returns:
the bin name.

addSeedDocuments

public void addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities,
                             org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
                             long startTime,
                             long endTime)
                      throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                             org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Queue "seed" documents. Seed documents are the starting places for crawling activity. Documents are seeded when this method calls appropriate methods in the passed in ISeedingActivity object. This method can choose to find repository changes that happen only during the specified time interval. The seeds recorded by this method will be viewed by the framework based on what the getConnectorModel() method returns. It is not a big problem if the connector chooses to create more seeds than are strictly necessary; it is merely a question of overall work required. The times passed to this method may be interpreted for greatest efficiency. The time ranges any given job uses with this connector will not overlap, but will proceed starting at 0 and going to the "current time", each time the job is run. For continuous crawling jobs, this method will be called once, when the job starts, and at various periodic intervals as the job executes. When a job's specification is changed, the framework automatically resets the seeding start time to 0. The seeding start time may also be set to 0 on each job run, depending on the connector model returned by getConnectorModel(). Note that it is always ok to send MORE documents rather than less to this method.

Overrides:
addSeedDocuments in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is a document specification (that comes from the job).
startTime - is the beginning of the time range to consider, inclusive.
endTime - is the end of the time range to consider, exclusive.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

getDocumentVersions

public java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers,
                                              java.lang.String[] oldVersions,
                                              org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities,
                                              org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
                                              int jobMode,
                                              boolean usesDefaultAuthority)
                                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                              org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Get document versions given an array of document identifiers. This method is called for EVERY document that is considered. It is therefore important to perform as little work as possible here.

Specified by:
getDocumentVersions in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getDocumentVersions in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
documentIdentifiers - is the array of local document identifiers, as understood by this connector.
oldVersions - is the corresponding array of version strings that have been saved for the document identifiers. A null value indicates that this is a first-time fetch, while an empty string indicates that the previous document had an empty version string.
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is the current document specification for the current job. If there is a dependency on this specification, then the version string should include the pertinent data, so that reingestion will occur when the specification changes. This is primarily useful for metadata.
jobMode - is an integer describing how the job is being run, whether continuous or once-only.
usesDefaultAuthority - will be true only if the authority in use for these documents is the default one.
Returns:
the corresponding version strings, with null in the places where the document no longer exists. Empty version strings indicate that there is no versioning ability for the corresponding document, and the document will always be processed.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

processDocuments

public void processDocuments(java.lang.String[] documentIdentifiers,
                             java.lang.String[] versions,
                             org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                             org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
                             boolean[] scanOnly)
                      throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                             org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Process a set of documents. This is the method that should cause each document to be fetched, processed, and the results either added to the queue of documents for the current job, and/or entered into the incremental ingestion manager. The document specification allows this class to filter what is done based on the job.

Overrides:
processDocuments in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
documentIdentifiers - is the set of document identifiers to process.
activities - is the interface this method should use to queue up new document references and ingest documents.
spec - is the document specification.
scanOnly - is an array corresponding to the document identifiers. It is set to true to indicate when the processing should only find other references, and should not actually call the ingestion methods.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

releaseDocumentVersions

public void releaseDocumentVersions(java.lang.String[] documentIdentifiers,
                                    java.lang.String[] versions)
                             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Free a set of documents. This method is called for all documents whose versions have been fetched using the getDocumentVersions() method, including those that returned null versions. It may be used to free resources committed during the getDocumentVersions() method. It is guaranteed to be called AFTER any calls to processDocuments() for the documents in question.

Specified by:
releaseDocumentVersions in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
releaseDocumentVersions in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
documentIdentifiers - is the set of document identifiers.
versions - is the corresponding set of version identifiers (individual identifiers may be null).
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

getMaxDocumentRequest

public int getMaxDocumentRequest()
Get the maximum number of documents to amalgamate together into one batch, for this connector.

Specified by:
getMaxDocumentRequest in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getMaxDocumentRequest in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Returns:
the maximum number. 0 indicates "unlimited".

outputConfigurationHeader

public void outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                      org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                      org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
                                      java.util.ArrayList tabsArray)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                      java.io.IOException
Output the configuration header section. This method is called in the head section of the connector's configuration page. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the configuration editing HTML.

Specified by:
outputConfigurationHeader in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
outputConfigurationHeader in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
out - is the output to which any HTML should be sent.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

outputConfigurationBody

public void outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                    org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                    org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
                                    java.lang.String tabName)
                             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                    java.io.IOException
Output the configuration body section. This method is called in the body section of the connector's configuration page. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate , , and
tags. The name of the form is "editconnection".

Specified by:
outputConfigurationBody in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
outputConfigurationBody in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
out - is the output to which any HTML should be sent.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
tabName - is the current tab name.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

processConfigurationPost

public java.lang.String processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                                 org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
                                                 org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
                                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Process a configuration post. This method is called at the start of the connector's configuration page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the configuration parameters accordingly. The name of the posted form is "editconnection".

Specified by:
processConfigurationPost in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
processConfigurationPost in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
variableContext - is the set of variables available from the post, including binary file post information.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
Returns:
null if all is well, or a string error message if there is an error that should prevent saving of the connection (and cause a redirection to an error page).
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

viewConfiguration

public void viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                              org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                              org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                              java.io.IOException
View configuration. This method is called in the body section of the connector's view configuration page. Its purpose is to present the connection information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate and tags.

Specified by:
viewConfiguration in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
viewConfiguration in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
out - is the output to which any HTML should be sent.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

outputSpecificationHeader

public void outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                      org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds,
                                      java.util.ArrayList tabsArray)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                      java.io.IOException
Output the specification header section. This method is called in the head section of a job page which has selected a repository connection of the current type. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the job editing HTML.

Specified by:
outputSpecificationHeader in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
outputSpecificationHeader in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

outputSpecificationBody

public void outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                    org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds,
                                    java.lang.String tabName)
                             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                    java.io.IOException
Output the specification body section. This method is called in the body section of a job page which has selected a repository connection of the current type. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate , , and tags. The name of the form is "editjob".

Specified by:
outputSpecificationBody in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
outputSpecificationBody in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
tabName - is the current tab name.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

processSpecificationPost

public java.lang.String processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
                                                 org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
                                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Process a specification post. This method is called at the start of job's edit or view page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the document specification accordingly. The name of the posted form is "editjob".

Specified by:
processSpecificationPost in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
processSpecificationPost in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
variableContext - contains the post data, including binary file-upload information.
ds - is the current document specification for this job.
Returns:
null if all is well, or a string error message if there is an error that should prevent saving of the job (and cause a redirection to an error page).
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

viewSpecification

public void viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                              org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                              java.io.IOException
View specification. This method is called in the body section of a job's view page. Its purpose is to present the document specification information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate and tags.

Specified by:
viewSpecification in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
viewSpecification in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

makeSessionLoginEventName

protected java.lang.String makeSessionLoginEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
                                                     java.lang.String sequenceKey)
Calculate the event name for session login.


makeDNSEventName

protected java.lang.String makeDNSEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
                                            java.lang.String hostNameKey)
Calculate the event name for DNS access.


lookupIPAddress

protected int lookupIPAddress(java.lang.String documentIdentifier,
                              org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities,
                              java.lang.String hostName,
                              long currentTime,
                              java.lang.StringBuffer ipAddressBuffer)
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                              org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Look up an ipaddress given a non-canonical host name.

Returns:
appropriate status.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

makeRobotsKey

protected static java.lang.String makeRobotsKey(java.lang.String protocol,
                                                java.lang.String hostName,
                                                int port)
Construct the robots key for a host. This is used to look up robots info in the database, and to form the corresponding event name.


makeRobotsEventName

protected java.lang.String makeRobotsEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity versionActivities,
                                               java.lang.String robotsKey)
Construct a name for the global web-connector robots event.


checkFetchAllowed

protected int checkFetchAllowed(java.lang.String documentIdentifier,
                                java.lang.String protocol,
                                java.lang.String hostIPAddress,
                                int port,
                                PageCredentials credential,
                                org.apache.manifoldcf.core.interfaces.IKeystoreManager trustStore,
                                java.lang.String hostName,
                                java.lang.String[] binNames,
                                long currentTime,
                                java.lang.String pathString,
                                org.apache.manifoldcf.crawler.interfaces.IVersionActivity versionActivities,
                                int connectionLimit)
                         throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Check robots to see if fetch is allowed.

Returns:
appropriate resultstatus code.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

makeDocumentIdentifier

protected java.lang.String makeDocumentIdentifier(java.lang.String parentIdentifier,
                                                  java.lang.String rawURL,
                                                  WebcrawlerConnector.DocumentURLFilter filter)
                                           throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Convert an absolute or relative URL to a document identifier. This may involve several steps at some point, but right now it does NOT involve converting the host name to a canonical host name. (Doing so would destroy the ability of virtually hosted sites to do the right thing, since the original host name would be lost.) Thus, we do the conversion to IP address right before we actually fetch the document.

Parameters:
parentIdentifier - the identifier of the document in which the raw url was found, or null if none.
rawURL - the starting, un-normalized, un-canonicalized URL.
filter - the filter object, used to remove unmatching URLs.
Returns:
the canonical URL (the document identifier), or null if the url was illegal.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

doCanonicalization

protected java.lang.String doCanonicalization(WebcrawlerConnector.DocumentURLFilter filter,
                                              java.net.URI url)
                                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                              java.net.URISyntaxException
Code to canonicalize a URL. If URL cannot be canonicalized (and is illegal) return null.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.net.URISyntaxException

isContentInteresting

protected boolean isContentInteresting(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities,
                                       java.lang.String documentIdentifier,
                                       int response,
                                       java.lang.String contentType)
                                throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption,
                                       org.apache.manifoldcf.core.interfaces.ManifoldCFException
Code to check if data is interesting, based on response code and content type.

Throws:
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException

isDataIngestable

protected boolean isDataIngestable(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities,
                                   java.lang.String documentIdentifier)
                            throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption,
                                   org.apache.manifoldcf.core.interfaces.ManifoldCFException
Code to check if an already-fetched document should be ingested.

Throws:
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException

findRedirectionURI

protected java.lang.String findRedirectionURI(java.lang.String currentURI)
                                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Find a redirection URI, if it exists

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

findHTMLForm

protected FormData findHTMLForm(java.lang.String currentURI,
                                LoginParameters lp)
                         throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Find matching HTML form data, if present. Return null if not.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

findPreferredRedirectionURI

protected java.lang.String findPreferredRedirectionURI(java.lang.String currentURI,
                                                       LoginParameters lp)
                                                throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Find a preferred redirection URI, if it exists

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

findHTMLLinkURI

protected java.lang.String findHTMLLinkURI(java.lang.String currentURI,
                                           LoginParameters lp)
                                    throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Find HTML link URI, if present, making sure specified preference is matched.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

extractLinks

protected boolean extractLinks(java.lang.String documentIdentifier,
                               org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                               WebcrawlerConnector.DocumentURLFilter filter)
                        throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                               org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Code to extract links from an already-fetched document.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

handleRedirects

protected void handleRedirects(java.lang.String documentURI,
                               IRedirectionHandler handler)
                        throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Handle extracting the redirect link from a redirect response.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

handleXML

protected void handleXML(java.lang.String documentURI,
                         IXMLHandler handler)
                  throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                         org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Handle document references from XML. Right now we only understand RSS.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

handleHTML

protected void handleHTML(java.lang.String documentURI,
                          IHTMLHandler handler)
                   throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Handle document references from HTML

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

isDocumentText

protected boolean isDocumentText(java.lang.String documentURI)
                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Is the document text, as far as we can tell?

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

isText

protected static boolean isText(byte[] beginChunk,
                                int chunkLength)
Test to see if a document is text or not. The first n bytes are passed in, and this code returns "true" if it thinks they represent text. The code has been lifted algorithmically from products/Sharecrawler/Fingerprinter.pas, which was based on "perldoc -f -T".


isStrange

protected static boolean isStrange(byte x)
Check if character is not typical ASCII.


isWhiteSpace

protected static boolean isWhiteSpace(byte x)
Check if a byte is a whitespace character.


stringToArray

protected static java.util.ArrayList stringToArray(java.lang.String input)
Read a string as a sequence of individual expressions, urls, etc.


compileList

protected static void compileList(java.util.ArrayList output,
                                  java.util.ArrayList input)
                           throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Compile all regexp entries in the passed in list, and add them to the output list.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

getPageCredential

protected PageCredentials getPageCredential(java.lang.String documentIdentifier)
Get the page credentials for a given document identifier (URL)


getSequenceCredential

protected SequenceCredentials getSequenceCredential(java.lang.String documentIdentifier)
Get the sequence credentials for a given document identifier (URL)


getTrustStore

protected org.apache.manifoldcf.core.interfaces.IKeystoreManager getTrustStore(java.lang.String documentIdentifier)
                                                                        throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Get the trust store for a given document identifier (URL)

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

getAcls

protected static java.lang.String[] getAcls(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
Grab forced acl out of document specification.

Parameters:
spec - is the document specification.
Returns:
the acls.

findMetadata

protected static java.util.ArrayList findMetadata(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
                                           throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Read a document specification to yield a map of name/value pairs for metadata

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

pack

protected static void pack(java.lang.StringBuffer output,
                           java.lang.String value,
                           char delimiter)
Stuffer for packing a single string with an end delimiter


unpack

protected static int unpack(java.lang.StringBuffer sb,
                            java.lang.String value,
                            int startPosition,
                            char delimiter)
Unstuffer for the above.


packFixedList

protected static void packFixedList(java.lang.StringBuffer output,
                                    java.lang.String[] values,
                                    char delimiter)
Stuffer for packing lists of fixed length


unpackFixedList

protected static int unpackFixedList(java.lang.String[] output,
                                     java.lang.String value,
                                     int startPosition,
                                     char delimiter)
Unstuffer for unpacking lists of fixed length


packList

protected static void packList(java.lang.StringBuffer output,
                               java.util.ArrayList values,
                               char delimiter)
Stuffer for packing lists of variable length


packList

protected static void packList(java.lang.StringBuffer output,
                               java.lang.String[] values,
                               char delimiter)
Another stuffer for packing lists of variable length


unpackList

protected static int unpackList(java.util.ArrayList output,
                                java.lang.String value,
                                int startPosition,
                                char delimiter)
Unstuffer for unpacking lists of variable length.

Parameters:
output - is the array into which the unpacked output is written.
value - is the value to unpack.
startPosition - is the place to start the unpack.
delimiter - is the character to use between values.
Returns:
the next position beyond the end of the list.

calculateDocumentEvents

protected java.lang.String[] calculateDocumentEvents(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
                                                     java.lang.String documentIdentifier)
Calculate events that should be associated with a document.