org.apache.manifoldcf.crawler.connectors.rss
Class RSSConnector.Filter

java.lang.Object
  extended by org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.Filter
Enclosing class:
RSSConnector

protected static class RSSConnector.Filter
extends java.lang.Object

Class that handles parsing and interpretation of the document specification. Note that I believe it to be faster to do this once, gathering all the data, than to scan the document specification multiple times. Therefore, this class contains the *entire* interpreted set of data from a document specification.


Field Summary
protected  java.util.HashMap acls
           
protected  java.lang.Integer badFeedRescanInterval
           
protected  RSSConnector.CanonicalizationPolicies canonicalizationPolicies
           
protected  int chromedContentMode
           
protected  int dechromedContentMode
           
protected  java.lang.Integer defaultRescanInterval
           
protected  int feedTimeoutValue
           
protected  RSSConnector.MappingRules mappings
           
protected  java.util.ArrayList metadata
           
protected  java.lang.Integer minimumRescanInterval
           
protected  java.util.HashMap seeds
           
 
Constructor Summary
RSSConnector.Filter(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec, boolean warnOnBadSeed)
          Constructor.
 
Method Summary
 java.lang.String[] getAcls()
          Get the acls
 java.lang.Long getBadFeedRescanTime(long currentTime)
          Get the next time a "bad feed" should be rescanned
 RSSConnector.CanonicalizationPolicies getCanonicalizationPolicies()
          Get canonicalization policies
 int getChromedContentMode()
          Get the chromed content mode
 int getDechromedContentMode()
          Get the dechromed content mode
 java.lang.Long getDefaultRescanTime(long currentTime)
          Get the next time (by default) a feed should be scanned
 int getFeedTimeoutValue()
          Get the feed timeout value
 java.util.ArrayList getMetadata()
          Get the specified metadata
 java.lang.Long getMinimumRescanTime(long currentTime)
          Get the minimum next time a feed should be scanned
 java.util.Iterator getSeeds()
          Iterate over all canonicalized seeds
 boolean isLegalURL(java.lang.String url)
          Check for legality of a url.
 boolean isSeed(java.lang.String canonicalUrl)
          Check if document is a seed
 java.lang.String mapDocumentURL(java.lang.String url)
          Scan patterns and return the one that matches first.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

mappings

protected RSSConnector.MappingRules mappings

seeds

protected java.util.HashMap seeds

defaultRescanInterval

protected java.lang.Integer defaultRescanInterval

minimumRescanInterval

protected java.lang.Integer minimumRescanInterval

badFeedRescanInterval

protected java.lang.Integer badFeedRescanInterval

dechromedContentMode

protected int dechromedContentMode

chromedContentMode

protected int chromedContentMode

feedTimeoutValue

protected int feedTimeoutValue

metadata

protected java.util.ArrayList metadata

acls

protected java.util.HashMap acls

canonicalizationPolicies

protected RSSConnector.CanonicalizationPolicies canonicalizationPolicies
Constructor Detail

RSSConnector.Filter

public RSSConnector.Filter(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
                           boolean warnOnBadSeed)
                    throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Constructor.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
Method Detail

isSeed

public boolean isSeed(java.lang.String canonicalUrl)
Check if document is a seed


getSeeds

public java.util.Iterator getSeeds()
Iterate over all canonicalized seeds


getMetadata

public java.util.ArrayList getMetadata()
Get the specified metadata


getAcls

public java.lang.String[] getAcls()
Get the acls


getFeedTimeoutValue

public int getFeedTimeoutValue()
Get the feed timeout value


getDechromedContentMode

public int getDechromedContentMode()
Get the dechromed content mode


getChromedContentMode

public int getChromedContentMode()
Get the chromed content mode


getDefaultRescanTime

public java.lang.Long getDefaultRescanTime(long currentTime)
Get the next time (by default) a feed should be scanned


getMinimumRescanTime

public java.lang.Long getMinimumRescanTime(long currentTime)
Get the minimum next time a feed should be scanned


getBadFeedRescanTime

public java.lang.Long getBadFeedRescanTime(long currentTime)
Get the next time a "bad feed" should be rescanned


isLegalURL

public boolean isLegalURL(java.lang.String url)
Check for legality of a url.

Returns:
true if the passed-in url is either a seed, or a legal url, according to this filter.

mapDocumentURL

public java.lang.String mapDocumentURL(java.lang.String url)
                                throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Scan patterns and return the one that matches first.

Returns:
null if the url doesn't match or should not be ingested, or the new string if it does.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

getCanonicalizationPolicies

public RSSConnector.CanonicalizationPolicies getCanonicalizationPolicies()
Get canonicalization policies