public class DOMContentUtils extends Object
| Modifier and Type | Class and Description |
|---|---|
static class |
DOMContentUtils.LinkParams |
| Constructor and Description |
|---|
DOMContentUtils(org.apache.hadoop.conf.Configuration conf) |
| Modifier and Type | Method and Description |
|---|---|
URL |
getBase(Node node)
If Node contains a BASE tag then it's HREF is returned.
|
void |
getOutlinks(URL base,
ArrayList<Outlink> outlinks,
Node node)
|
void |
getText(StringBuffer sb,
Node node)
This is a convinience method, equivalent to
getText(sb, node, false). |
boolean |
getText(StringBuffer sb,
Node node,
boolean abortOnNestedAnchors)
This method takes a
StringBuffer and a DOM Node,
and will append all the content text found beneath the DOM node to
the StringBuffer. |
boolean |
getTitle(StringBuffer sb,
Node node)
This method takes a
StringBuffer and a DOM Node,
and will append the content text found beneath the first
title node to the StringBuffer. |
void |
setConf(org.apache.hadoop.conf.Configuration conf) |
public DOMContentUtils(org.apache.hadoop.conf.Configuration conf)
public void setConf(org.apache.hadoop.conf.Configuration conf)
public boolean getText(StringBuffer sb, Node node, boolean abortOnNestedAnchors)
StringBuffer and a DOM Node,
and will append all the content text found beneath the DOM node to
the StringBuffer.
If abortOnNestedAnchors is true, DOM traversal will
be aborted and the StringBuffer will not contain
any text encountered after a nested anchor is found.
public void getText(StringBuffer sb, Node node)
getText(sb, node, false).public boolean getTitle(StringBuffer sb, Node node)
StringBuffer and a DOM Node,
and will append the content text found beneath the first
title node to the StringBuffer.public void getOutlinks(URL base, ArrayList<Outlink> outlinks, Node node)
node, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks ArrayList.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
Copyright © 2014 The Apache Software Foundation