org.apache.manifoldcf.crawler.connectors.webcrawler
Class WebcrawlerConnector.DocumentURLFilter

java.lang.Object
  extended by org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.DocumentURLFilter
Enclosing class:
WebcrawlerConnector

protected static class WebcrawlerConnector.DocumentURLFilter
extends java.lang.Object

This class describes the url filtering information obtained from a digested DocumentSpecification.


Field Summary
protected  WebcrawlerConnector.CanonicalizationPolicies canonicalizationPolicies
          Canonicalization policies
protected  java.util.ArrayList excludePatterns
          The arraylist of exclude patterns
protected  java.util.ArrayList includePatterns
          The arraylist of include patterns
protected  java.util.HashMap seedHosts
          The hash map of seed hosts, to limit urls by, if non-null
 
Constructor Summary
WebcrawlerConnector.DocumentURLFilter(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
          Process a document specification to produce a filter.
 
Method Summary
 WebcrawlerConnector.CanonicalizationPolicies getCanonicalizationPolicies()
          Get canonicalization policies
 boolean isDocumentAndHostLegal(java.lang.String url)
          Check if both a document and host are legal.
 boolean isDocumentLegal(java.lang.String url)
          Check if the document identifier is legal.
 boolean isHostLegal(java.lang.String host)
          Check if a host is legal.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

includePatterns

protected java.util.ArrayList includePatterns
The arraylist of include patterns


excludePatterns

protected java.util.ArrayList excludePatterns
The arraylist of exclude patterns


seedHosts

protected java.util.HashMap seedHosts
The hash map of seed hosts, to limit urls by, if non-null


canonicalizationPolicies

protected WebcrawlerConnector.CanonicalizationPolicies canonicalizationPolicies
Canonicalization policies

Constructor Detail

WebcrawlerConnector.DocumentURLFilter

public WebcrawlerConnector.DocumentURLFilter(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
                                      throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Process a document specification to produce a filter. Note that we EXPECT the regexp's in the document specification to be properly formed. This should be checked at save time to prevent errors. Any syntax errors found here will thus cause the include or exclude regexp to be skipped.

Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
Method Detail

isDocumentAndHostLegal

public boolean isDocumentAndHostLegal(java.lang.String url)
Check if both a document and host are legal.


isHostLegal

public boolean isHostLegal(java.lang.String host)
Check if a host is legal.


isDocumentLegal

public boolean isDocumentLegal(java.lang.String url)
Check if the document identifier is legal.


getCanonicalizationPolicies

public WebcrawlerConnector.CanonicalizationPolicies getCanonicalizationPolicies()
Get canonicalization policies