org.apache.manifoldcf.crawler.connectors.rss
Class RSSConnector

java.lang.Object
  extended by org.apache.manifoldcf.core.connector.BaseConnector
      extended by org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
          extended by org.apache.manifoldcf.crawler.connectors.rss.RSSConnector
All Implemented Interfaces:
org.apache.manifoldcf.core.interfaces.IConnector, org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector

public class RSSConnector
extends org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector

This is the RSS implementation of the IRepositoryConnector interface. This connector basically looks at an RSS document in order to seed the document queue. The document is always fetched from the same URL (it's specified in the configuration parameters). The documents subsequently crawled are not scraped for additional links; only the primary document is ingested. On the other hand, redirections ARE honored, so that various sites that use this trick can be supported (e.g. the BBC)


Nested Class Summary
protected static class RSSConnector.CanonicalizationPolicies
          Class representing a list of canonicalization rules
protected static class RSSConnector.CanonicalizationPolicy
          Class representing a URL regular expression match, for the purposes of determining canonicalization policy
protected static class RSSConnector.EvaluatorToken
          Evaluator token.
protected static class RSSConnector.EvaluatorTokenStream
          Token stream.
protected  class RSSConnector.FeedContextClass
           
protected  class RSSConnector.FeedItemContextClass
           
protected static class RSSConnector.Filter
          Class that handles parsing and interpretation of the document specification.
protected static class RSSConnector.MappingRule
          Class representing a mapping rule
protected static class RSSConnector.MappingRules
          Class that represents all mappings
protected static class RSSConnector.NameValue
          Name/value class
protected  class RSSConnector.OuterContextClass
          This class handles the outermost XML context for the feed document.
protected  class RSSConnector.RDFContextClass
           
protected  class RSSConnector.RDFItemContextClass
           
protected  class RSSConnector.RSSChannelContextClass
           
protected  class RSSConnector.RSSContextClass
           
protected  class RSSConnector.RSSItemContextClass
           
 
Field Summary
static java.lang.String _rcsid
           
static java.lang.String ACTIVITY_FETCH
           
static java.lang.String ACTIVITY_ROBOTSPARSE
           
static java.lang.String bandwidthParameter
          Max kilobytes per second per server
protected static DataCache cache
           
static int CHROMED_SKIP
          Chromed suppression mode - skip all chromed content
static int CHROMED_USE
          Chromed suppression mode - use chromed content
static int DECHROMED_CONTENT
          Dechromed content mode - content field
static int DECHROMED_DESCRIPTION
          Dechromed content mode - description field
static int DECHROMED_NONE
          Dechromed content mode - none
static java.lang.String emailParameter
          Email parameter
protected  ThrottledFetcher fetcher
          The throttled fetcher used by this instance
protected static java.util.Map fetcherMap
          Storage for fetcher objects
protected  java.lang.String from
          The email address for this connector instance
protected  boolean isInitialized
          Flag indicating whether session data is initialized
static java.lang.String maxFetchesParameter
          Max fetches per minute per server
protected  int maxOpenConnectionsPerServer
          The maximum open connections
static java.lang.String maxOpenParameter
          Max simultaneous open connections per server
protected static java.util.HashMap milTzMap
          Timezone mapping from RFC822 timezones to ones understood by Java
protected  double minimumMillisecondsPerBytePerServer
          The minimum milliseconds between bytes
protected  long minimumMillisecondsPerFetchPerServer
          The minimum milliseconds between fetches
protected static java.util.HashMap monthMap
           
protected  java.lang.String proxyAuthDomain
          Proxy auth domain
static java.lang.String proxyAuthDomainParameter
          Proxy auth domain
protected  java.lang.String proxyAuthPassword
          Proxy auth password
static java.lang.String proxyAuthPasswordParameter
          Proxy auth password
protected  java.lang.String proxyAuthUsername
          Proxy auth username
static java.lang.String proxyAuthUsernameParameter
          Proxy auth username
protected  java.lang.String proxyHost
          The proxy host
static java.lang.String proxyHostParameter
          Proxy host name
protected  int proxyPort
          The proxy port
static java.lang.String proxyPortParameter
          Proxy port
protected  Robots robots
          The robots object used by this instance
protected static int ROBOTS_ALL
           
protected static int ROBOTS_DATA
           
protected static int ROBOTS_NONE
           
protected static java.util.Map robotsMap
          Storage for robots objects
protected  int robotsUsage
          Robots usage flag
static java.lang.String robotsUsageParameter
          Robots usage parameter
protected  java.lang.String throttleGroupName
          The throttle group name
static java.lang.String throttleGroupParameter
          The throttle group name
protected static java.util.Map understoodProtocols
           
protected  java.lang.String userAgent
          The user-agent for this connector instance
 
Fields inherited from class org.apache.manifoldcf.core.connector.BaseConnector
currentContext, params
 
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
JOBMODE_CONTINUOUS, JOBMODE_ONCEONLY, MODEL_ADD, MODEL_ADD_CHANGE, MODEL_ADD_CHANGE_DELETE, MODEL_ALL, MODEL_PARTIAL
 
Constructor Summary
RSSConnector()
          Constructor.
 
Method Summary
 void addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec, long startTime, long endTime)
          Queue "seed" documents.
 java.lang.String check()
          Check status of connection.
 void connect(org.apache.manifoldcf.core.interfaces.ConfigParams configParams)
          Connect.
 void disconnect()
          Close the connection.
protected static java.lang.String doCanonicalization(RSSConnector.CanonicalizationPolicy p, java.net.URI url)
          Code to canonicalize a URL.
 java.lang.String[] getActivitiesList()
          Return the list of activities that this connector supports (i.e.
 java.lang.String[] getBinNames(java.lang.String documentIdentifier)
          Get the bin name string for a document identifier.
 int getConnectorModel()
          Tell the world what model this connector uses for getDocumentIdentifiers().
 java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers, java.lang.String[] oldVersions, org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec, int jobType, boolean usesDefaultAuthority)
          Get document versions given an array of document identifiers.
protected  ThrottledFetcher getFetcher()
          Given the current parameters, find the correct throttled fetcher object (or create one if not there).
 java.lang.String getJSPFolder()
          Return the path for the UI interface JSP elements.
 int getMaxDocumentRequest()
          Get the maximum number of documents to amalgamate together into one batch, for this connector.
protected  Robots getRobots(ThrottledFetcher fetcher)
          Given the current parameters, find the correct robots object (or create one if none found).
protected  void getSession()
          Establish a session
protected  void handleRSSFeedSAX(java.lang.String documentIdentifier, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, RSSConnector.Filter filter)
          Handle an RSS feed document, using SAX to limit the memory impact
protected  boolean isContentInteresting(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities, java.lang.String contentType)
          Code to check if data is interesting, based on response code and content type.
protected static java.lang.String makeDocumentIdentifier(RSSConnector.CanonicalizationPolicies policies, java.lang.String parentIdentifier, java.lang.String rawURL)
          Convert an absolute or relative URL to a document identifier.
 void outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.lang.String tabName)
          Output the configuration body section.
 void outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.util.ArrayList tabsArray)
          Output the configuration header section.
 void outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds, java.lang.String tabName)
          Output the specification body section.
 void outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds, java.util.ArrayList tabsArray)
          Output the specification header section.
protected static void pack(java.lang.StringBuffer output, java.lang.String value, char delimiter)
          Stuffer for packing a single string with an end delimiter
protected static void packFixedList(java.lang.StringBuffer output, java.lang.String[] values, char delimiter)
          Stuffer for packing lists of fixed length
protected static void packList(java.lang.StringBuffer output, java.util.ArrayList values, char delimiter)
          Stuffer for packing lists of variable length
protected static void packList(java.lang.StringBuffer output, java.lang.String[] values, char delimiter)
          Another stuffer for packing lists of variable length
protected static java.lang.Long parseChinaDate(java.lang.String dateValue)
          Parse a China Daily News date
protected static java.lang.Long parseRSSDate(java.lang.String dateValue)
          Parse an RSS date
protected static java.lang.Long parseZuluDate(java.lang.String dateValue)
          Parse an RDF date
 void poll()
          This method is periodically called for all connectors that are connected but not in active use.
 java.lang.String processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
          Process a configuration post.
 void processDocuments(java.lang.String[] documentIdentifiers, java.lang.String[] versions, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec, boolean[] scanOnly, int jobType)
          Process a set of documents.
 java.lang.String processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
          Process a specification post.
 void releaseDocumentVersions(java.lang.String[] documentIdentifiers, java.lang.String[] versions)
          Free a set of documents.
protected static int unpack(java.lang.StringBuffer sb, java.lang.String value, int startPosition, char delimiter)
          Unstuffer for the above.
protected static int unpackFixedList(java.lang.String[] output, java.lang.String value, int startPosition, char delimiter)
          Unstuffer for unpacking lists of fixed length
protected static int unpackList(java.util.ArrayList output, java.lang.String value, int startPosition, char delimiter)
          Unstuffer for unpacking lists of variable length.
 void viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
          View configuration.
 void viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
          View specification.
 
Methods inherited from class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
addSeedDocuments, getDocumentIdentifiers, getDocumentIdentifiers, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getRelationshipTypes, getRemainingDocumentIdentifiers, processDocuments, requestInfo
 
Methods inherited from class org.apache.manifoldcf.core.connector.BaseConnector
clearThreadContext, deinstall, getConfiguration, install, setThreadContext
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.manifoldcf.core.interfaces.IConnector
clearThreadContext, deinstall, getConfiguration, install, setThreadContext
 

Field Detail

_rcsid

public static final java.lang.String _rcsid
See Also:
Constant Field Values

robotsUsageParameter

public static final java.lang.String robotsUsageParameter
Robots usage parameter

See Also:
Constant Field Values

emailParameter

public static final java.lang.String emailParameter
Email parameter

See Also:
Constant Field Values

bandwidthParameter

public static final java.lang.String bandwidthParameter
Max kilobytes per second per server

See Also:
Constant Field Values

maxOpenParameter

public static final java.lang.String maxOpenParameter
Max simultaneous open connections per server

See Also:
Constant Field Values

maxFetchesParameter

public static final java.lang.String maxFetchesParameter
Max fetches per minute per server

See Also:
Constant Field Values

throttleGroupParameter

public static final java.lang.String throttleGroupParameter
The throttle group name

See Also:
Constant Field Values

proxyHostParameter

public static final java.lang.String proxyHostParameter
Proxy host name

See Also:
Constant Field Values

proxyPortParameter

public static final java.lang.String proxyPortParameter
Proxy port

See Also:
Constant Field Values

proxyAuthDomainParameter

public static final java.lang.String proxyAuthDomainParameter
Proxy auth domain

See Also:
Constant Field Values

proxyAuthUsernameParameter

public static final java.lang.String proxyAuthUsernameParameter
Proxy auth username

See Also:
Constant Field Values

proxyAuthPasswordParameter

public static final java.lang.String proxyAuthPasswordParameter
Proxy auth password

See Also:
Constant Field Values

ROBOTS_NONE

protected static final int ROBOTS_NONE
See Also:
Constant Field Values

ROBOTS_DATA

protected static final int ROBOTS_DATA
See Also:
Constant Field Values

ROBOTS_ALL

protected static final int ROBOTS_ALL
See Also:
Constant Field Values

DECHROMED_NONE

public static final int DECHROMED_NONE
Dechromed content mode - none

See Also:
Constant Field Values

DECHROMED_DESCRIPTION

public static final int DECHROMED_DESCRIPTION
Dechromed content mode - description field

See Also:
Constant Field Values

DECHROMED_CONTENT

public static final int DECHROMED_CONTENT
Dechromed content mode - content field

See Also:
Constant Field Values

CHROMED_USE

public static final int CHROMED_USE
Chromed suppression mode - use chromed content

See Also:
Constant Field Values

CHROMED_SKIP

public static final int CHROMED_SKIP
Chromed suppression mode - skip all chromed content

See Also:
Constant Field Values

robotsUsage

protected int robotsUsage
Robots usage flag


userAgent

protected java.lang.String userAgent
The user-agent for this connector instance


from

protected java.lang.String from
The email address for this connector instance


minimumMillisecondsPerFetchPerServer

protected long minimumMillisecondsPerFetchPerServer
The minimum milliseconds between fetches


maxOpenConnectionsPerServer

protected int maxOpenConnectionsPerServer
The maximum open connections


minimumMillisecondsPerBytePerServer

protected double minimumMillisecondsPerBytePerServer
The minimum milliseconds between bytes


throttleGroupName

protected java.lang.String throttleGroupName
The throttle group name


proxyHost

protected java.lang.String proxyHost
The proxy host


proxyPort

protected int proxyPort
The proxy port


proxyAuthDomain

protected java.lang.String proxyAuthDomain
Proxy auth domain


proxyAuthUsername

protected java.lang.String proxyAuthUsername
Proxy auth username


proxyAuthPassword

protected java.lang.String proxyAuthPassword
Proxy auth password


fetcher

protected ThrottledFetcher fetcher
The throttled fetcher used by this instance


robots

protected Robots robots
The robots object used by this instance


fetcherMap

protected static java.util.Map fetcherMap
Storage for fetcher objects


robotsMap

protected static java.util.Map robotsMap
Storage for robots objects


isInitialized

protected boolean isInitialized
Flag indicating whether session data is initialized


cache

protected static DataCache cache

understoodProtocols

protected static final java.util.Map understoodProtocols

ACTIVITY_FETCH

public static final java.lang.String ACTIVITY_FETCH
See Also:
Constant Field Values

ACTIVITY_ROBOTSPARSE

public static final java.lang.String ACTIVITY_ROBOTSPARSE
See Also:
Constant Field Values

monthMap

protected static java.util.HashMap monthMap

milTzMap

protected static final java.util.HashMap milTzMap
Timezone mapping from RFC822 timezones to ones understood by Java

Constructor Detail

RSSConnector

public RSSConnector()
Constructor.

Method Detail

getSession

protected void getSession()
                   throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Establish a session

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

getActivitiesList

public java.lang.String[] getActivitiesList()
Return the list of activities that this connector supports (i.e. writes into the log).

Specified by:
getActivitiesList in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getActivitiesList in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Returns:
the list.

getConnectorModel

public int getConnectorModel()
Tell the world what model this connector uses for getDocumentIdentifiers(). This must return a model value as specified above.

Specified by:
getConnectorModel in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getConnectorModel in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Returns:
the model type value.

getJSPFolder

public java.lang.String getJSPFolder()
Return the path for the UI interface JSP elements. These JSP's must be provided to allow the connector to be configured, and to permit it to present document filtering specification information in the UI. This method should return the name of the folder, under the /connectors/ area, where the appropriate JSP's can be found. The name should NOT have a slash in it.

Returns:
the folder part

connect

public void connect(org.apache.manifoldcf.core.interfaces.ConfigParams configParams)
Connect. The configuration parameters are included.

Specified by:
connect in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
connect in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
configParams - are the configuration parameters for this connection. Note well: There are no exceptions allowed from this call, since it is expected to mainly establish connection parameters.

poll

public void poll()
          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
This method is periodically called for all connectors that are connected but not in active use.

Specified by:
poll in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
poll in class org.apache.manifoldcf.core.connector.BaseConnector
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

check

public java.lang.String check()
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Check status of connection.

Specified by:
check in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
check in class org.apache.manifoldcf.core.connector.BaseConnector
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

disconnect

public void disconnect()
                throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Close the connection. Call this before discarding the repository connector.

Specified by:
disconnect in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
disconnect in class org.apache.manifoldcf.core.connector.BaseConnector
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

getBinNames

public java.lang.String[] getBinNames(java.lang.String documentIdentifier)
Get the bin name string for a document identifier. The bin name describes the queue to which the document will be assigned for throttling purposes. Throttling controls the rate at which items in a given queue are fetched; it does not say anything about the overall fetch rate, which may operate on multiple queues or bins. For example, if you implement a web crawler, a good choice of bin name would be the server name, since that is likely to correspond to a real resource that will need real throttle protection.

Specified by:
getBinNames in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getBinNames in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
documentIdentifier - is the document identifier.
Returns:
the bin name.

addSeedDocuments

public void addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities,
                             org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
                             long startTime,
                             long endTime)
                      throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                             org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Queue "seed" documents. Seed documents are the starting places for crawling activity. Documents are seeded when this method calls appropriate methods in the passed in ISeedingActivity object. This method can choose to find repository changes that happen only during the specified time interval. The seeds recorded by this method will be viewed by the framework based on what the getConnectorModel() method returns. It is not a big problem if the connector chooses to create more seeds than are strictly necessary; it is merely a question of overall work required. The times passed to this method may be interpreted for greatest efficiency. The time ranges any given job uses with this connector will not overlap, but will proceed starting at 0 and going to the "current time", each time the job is run. For continuous crawling jobs, this method will be called once, when the job starts, and at various periodic intervals as the job executes. When a job's specification is changed, the framework automatically resets the seeding start time to 0. The seeding start time may also be set to 0 on each job run, depending on the connector model returned by getConnectorModel(). Note that it is always ok to send MORE documents rather than less to this method.

Overrides:
addSeedDocuments in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is a document specification (that comes from the job).
startTime - is the beginning of the time range to consider, inclusive.
endTime - is the end of the time range to consider, exclusive.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

makeDocumentIdentifier

protected static java.lang.String makeDocumentIdentifier(RSSConnector.CanonicalizationPolicies policies,
                                                         java.lang.String parentIdentifier,
                                                         java.lang.String rawURL)
                                                  throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Convert an absolute or relative URL to a document identifier. This may involve several steps at some point, but right now it does NOT involve converting the host name to a canonical host name. (Doing so would destroy the ability of virtually hosted sites to do the right thing, since the original host name would be lost.) Thus, we do the conversion to IP address right before we actually fetch the document.

Parameters:
policies - are the canonicalization policies in effect.
parentIdentifier - the identifier of the document in which the raw url was found, or null if none.
rawURL - is the raw, un-normalized and un-canonicalized url.
Returns:
the canonical URL (the document identifier), or null if the url was illegal.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

doCanonicalization

protected static java.lang.String doCanonicalization(RSSConnector.CanonicalizationPolicy p,
                                                     java.net.URI url)
                                              throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                                     java.net.URISyntaxException
Code to canonicalize a URL. If URL cannot be canonicalized (and is illegal) return null.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.net.URISyntaxException

getDocumentVersions

public java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers,
                                              java.lang.String[] oldVersions,
                                              org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities,
                                              org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
                                              int jobType,
                                              boolean usesDefaultAuthority)
                                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                              org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Get document versions given an array of document identifiers. This method is called for EVERY document that is considered. It is therefore important to perform as little work as possible here.

Specified by:
getDocumentVersions in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getDocumentVersions in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
documentIdentifiers - is the array of local document identifiers, as understood by this connector.
oldVersions - is the corresponding array of version strings that have been saved for the document identifiers. A null value indicates that this is a first-time fetch, while an empty string indicates that the previous document had an empty version string.
activities - is the interface this method should use to perform whatever framework actions are desired.
spec - is the current document specification for the current job. If there is a dependency on this specification, then the version string should include the pertinent data, so that reingestion will occur when the specification changes. This is primarily useful for metadata.
jobType - is an integer describing how the job is being run, whether continuous or once-only.
usesDefaultAuthority - will be true only if the authority in use for these documents is the default one.
Returns:
the corresponding version strings, with null in the places where the document no longer exists. Empty version strings indicate that there is no versioning ability for the corresponding document, and the document will always be processed.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

processDocuments

public void processDocuments(java.lang.String[] documentIdentifiers,
                             java.lang.String[] versions,
                             org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                             org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
                             boolean[] scanOnly,
                             int jobType)
                      throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                             org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Process a set of documents. This is the method that should cause each document to be fetched, processed, and the results either added to the queue of documents for the current job, and/or entered into the incremental ingestion manager. The document specification allows this class to filter what is done based on the job.

Specified by:
processDocuments in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
processDocuments in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
documentIdentifiers - is the set of document identifiers to process.
activities - is the interface this method should use to queue up new document references and ingest documents.
spec - is the document specification.
scanOnly - is an array corresponding to the document identifiers. It is set to true to indicate when the processing should only find other references, and should not actually call the ingestion methods.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

releaseDocumentVersions

public void releaseDocumentVersions(java.lang.String[] documentIdentifiers,
                                    java.lang.String[] versions)
                             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Free a set of documents. This method is called for all documents whose versions have been fetched using the getDocumentVersions() method, including those that returned null versions. It may be used to free resources committed during the getDocumentVersions() method. It is guaranteed to be called AFTER any calls to processDocuments() for the documents in question.

Specified by:
releaseDocumentVersions in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
releaseDocumentVersions in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
documentIdentifiers - is the set of document identifiers.
versions - is the corresponding set of version identifiers (individual identifiers may be null).
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

outputConfigurationHeader

public void outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                      org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                      org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
                                      java.util.ArrayList tabsArray)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                      java.io.IOException
Output the configuration header section. This method is called in the head section of the connector's configuration page. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the configuration editing HTML.

Specified by:
outputConfigurationHeader in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
outputConfigurationHeader in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
out - is the output to which any HTML should be sent.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

outputConfigurationBody

public void outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                    org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                    org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
                                    java.lang.String tabName)
                             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                    java.io.IOException
Output the configuration body section. This method is called in the body section of the connector's configuration page. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate , , and
tags. The name of the form is "editconnection".

Specified by:
outputConfigurationBody in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
outputConfigurationBody in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
out - is the output to which any HTML should be sent.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
tabName - is the current tab name.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

processConfigurationPost

public java.lang.String processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                                 org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
                                                 org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
                                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Process a configuration post. This method is called at the start of the connector's configuration page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the configuration parameters accordingly. The name of the posted form is "editconnection".

Specified by:
processConfigurationPost in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
processConfigurationPost in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
variableContext - is the set of variables available from the post, including binary file post information.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
Returns:
null if all is well, or a string error message if there is an error that should prevent saving of the connection (and cause a redirection to an error page).
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

viewConfiguration

public void viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                              org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                              org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                              java.io.IOException
View configuration. This method is called in the body section of the connector's view configuration page. Its purpose is to present the connection information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate and tags.

Specified by:
viewConfiguration in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
viewConfiguration in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
out - is the output to which any HTML should be sent.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

outputSpecificationHeader

public void outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                      org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds,
                                      java.util.ArrayList tabsArray)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                      java.io.IOException
Output the specification header section. This method is called in the head section of a job page which has selected a repository connection of the current type. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the job editing HTML.

Specified by:
outputSpecificationHeader in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
outputSpecificationHeader in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

outputSpecificationBody

public void outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                    org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds,
                                    java.lang.String tabName)
                             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                    java.io.IOException
Output the specification body section. This method is called in the body section of a job page which has selected a repository connection of the current type. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate , , and tags. The name of the form is "editjob".

Specified by:
outputSpecificationBody in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
outputSpecificationBody in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
tabName - is the current tab name.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

processSpecificationPost

public java.lang.String processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
                                                 org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
                                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Process a specification post. This method is called at the start of job's edit or view page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the document specification accordingly. The name of the posted form is "editjob".

Specified by:
processSpecificationPost in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
processSpecificationPost in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
variableContext - contains the post data, including binary file-upload information.
ds - is the current document specification for this job.
Returns:
null if all is well, or a string error message if there is an error that should prevent saving of the job (and cause a redirection to an error page).
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

viewSpecification

public void viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                              org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                              java.io.IOException
View specification. This method is called in the body section of a job's view page. Its purpose is to present the document specification information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate and tags.

Specified by:
viewSpecification in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
viewSpecification in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

handleRSSFeedSAX

protected void handleRSSFeedSAX(java.lang.String documentIdentifier,
                                org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                                RSSConnector.Filter filter)
                         throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Handle an RSS feed document, using SAX to limit the memory impact

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

parseZuluDate

protected static java.lang.Long parseZuluDate(java.lang.String dateValue)
Parse an RDF date


parseChinaDate

protected static java.lang.Long parseChinaDate(java.lang.String dateValue)
Parse a China Daily News date


parseRSSDate

protected static java.lang.Long parseRSSDate(java.lang.String dateValue)
Parse an RSS date


getMaxDocumentRequest

public int getMaxDocumentRequest()
Get the maximum number of documents to amalgamate together into one batch, for this connector.

Specified by:
getMaxDocumentRequest in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getMaxDocumentRequest in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Returns:
the maximum number. 0 indicates "unlimited".

isContentInteresting

protected boolean isContentInteresting(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities,
                                       java.lang.String contentType)
                                throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption,
                                       org.apache.manifoldcf.core.interfaces.ManifoldCFException
Code to check if data is interesting, based on response code and content type.

Throws:
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException

pack

protected static void pack(java.lang.StringBuffer output,
                           java.lang.String value,
                           char delimiter)
Stuffer for packing a single string with an end delimiter


unpack

protected static int unpack(java.lang.StringBuffer sb,
                            java.lang.String value,
                            int startPosition,
                            char delimiter)
Unstuffer for the above.


packFixedList

protected static void packFixedList(java.lang.StringBuffer output,
                                    java.lang.String[] values,
                                    char delimiter)
Stuffer for packing lists of fixed length


unpackFixedList

protected static int unpackFixedList(java.lang.String[] output,
                                     java.lang.String value,
                                     int startPosition,
                                     char delimiter)
Unstuffer for unpacking lists of fixed length


packList

protected static void packList(java.lang.StringBuffer output,
                               java.util.ArrayList values,
                               char delimiter)
Stuffer for packing lists of variable length


packList

protected static void packList(java.lang.StringBuffer output,
                               java.lang.String[] values,
                               char delimiter)
Another stuffer for packing lists of variable length


unpackList

protected static int unpackList(java.util.ArrayList output,
                                java.lang.String value,
                                int startPosition,
                                char delimiter)
Unstuffer for unpacking lists of variable length.

Parameters:
output - is the array to fill with the unpacked data.
value - is the value to unpack.
startPosition - is the place to start the unpack.
delimiter - is the character to use between values.
Returns:
the next position beyond the end of the list.

getFetcher

protected ThrottledFetcher getFetcher()
Given the current parameters, find the correct throttled fetcher object (or create one if not there).


getRobots

protected Robots getRobots(ThrottledFetcher fetcher)
Given the current parameters, find the correct robots object (or create one if none found).