org.apache.manifoldcf.crawler.connectors.webcrawler
Class WebcrawlerConnector.FindHTMLHrefHandler

java.lang.Object
  extended by org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.FindHandler
      extended by org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.FindHTMLHrefHandler
All Implemented Interfaces:
IDiscoveredLinkHandler, IHTMLHandler, IMetaTagHandler
Enclosing class:
WebcrawlerConnector

protected class WebcrawlerConnector.FindHTMLHrefHandler
extends WebcrawlerConnector.FindHandler
implements IHTMLHandler

This class is the handler for HTML parsing during state transitions


Field Summary
protected  java.util.regex.Pattern preferredLinkPattern
           
 
Fields inherited from class org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.FindHandler
parentURI, targetURI
 
Constructor Summary
WebcrawlerConnector.FindHTMLHrefHandler(java.lang.String parentURI, java.util.regex.Pattern preferredLinkPattern)
           
 
Method Summary
 void noteAHREF(java.lang.String rawURL)
          Note discovered href
 void noteDiscoveredLink(java.lang.String rawURL)
          Override noteDiscoveredLink
 void noteFormEnd()
          Note the end of a form
 void noteFormInput(java.util.Map inputAttributes)
          Note an input tag
 void noteFormStart(java.util.Map formAttributes)
          Note the start of a form
 void noteFRAMESRC(java.lang.String rawURL)
          Note discovered FRAME SRC
 void noteIMGSRC(java.lang.String rawURL)
          Note discovered IMG SRC
 void noteLINKHREF(java.lang.String rawURL)
          Note discovered href
 void noteMetaTag(java.util.Map metaAttributes)
          Note a meta tag
 
Methods inherited from class org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.FindHandler
getTargetURI
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

preferredLinkPattern

protected java.util.regex.Pattern preferredLinkPattern
Constructor Detail

WebcrawlerConnector.FindHTMLHrefHandler

public WebcrawlerConnector.FindHTMLHrefHandler(java.lang.String parentURI,
                                               java.util.regex.Pattern preferredLinkPattern)
Method Detail

noteMetaTag

public void noteMetaTag(java.util.Map metaAttributes)
                 throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Note a meta tag

Specified by:
noteMetaTag in interface IMetaTagHandler
Parameters:
metaAttributes - are the attributes that belong to the tag.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

noteFormStart

public void noteFormStart(java.util.Map formAttributes)
                   throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Note the start of a form

Specified by:
noteFormStart in interface IHTMLHandler
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

noteFormInput

public void noteFormInput(java.util.Map inputAttributes)
                   throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Note an input tag

Specified by:
noteFormInput in interface IHTMLHandler
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

noteFormEnd

public void noteFormEnd()
                 throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Note the end of a form

Specified by:
noteFormEnd in interface IHTMLHandler
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

noteDiscoveredLink

public void noteDiscoveredLink(java.lang.String rawURL)
                        throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Override noteDiscoveredLink

Specified by:
noteDiscoveredLink in interface IDiscoveredLinkHandler
Overrides:
noteDiscoveredLink in class WebcrawlerConnector.FindHandler
Parameters:
rawURL - is the raw discovered url. This may be relative, malformed, or otherwise unsuitable for use until final form is acheived.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

noteAHREF

public void noteAHREF(java.lang.String rawURL)
               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Note discovered href

Specified by:
noteAHREF in interface IHTMLHandler
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

noteLINKHREF

public void noteLINKHREF(java.lang.String rawURL)
                  throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Note discovered href

Specified by:
noteLINKHREF in interface IHTMLHandler
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

noteIMGSRC

public void noteIMGSRC(java.lang.String rawURL)
                throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Note discovered IMG SRC

Specified by:
noteIMGSRC in interface IHTMLHandler
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

noteFRAMESRC

public void noteFRAMESRC(java.lang.String rawURL)
                  throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Note discovered FRAME SRC

Specified by:
noteFRAMESRC in interface IHTMLHandler
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException