|
||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||
java.lang.Objectorg.apache.manifoldcf.core.connector.BaseConnector
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector
public class WebcrawlerConnector
This is the Web Crawler implementation of the IRepositoryConnector interface. This connector may be superceded by one that calls out to python, or by a entirely python Connector Framework, depending on how the winds blow.
| Nested Class Summary | |
|---|---|
protected static class |
WebcrawlerConnector.CanonicalizationPolicies
Class representing a list of canonicalization rules |
protected static class |
WebcrawlerConnector.CanonicalizationPolicy
Class representing a URL regular expression match, for the purposes of determining canonicalization policy |
protected static class |
WebcrawlerConnector.DocumentURLFilter
This class describes the url filtering information obtained from a digested DocumentSpecification. |
protected class |
WebcrawlerConnector.FeedContextClass
|
protected class |
WebcrawlerConnector.FeedItemContextClass
|
protected class |
WebcrawlerConnector.FindHandler
This class is used to discover links in a session login context |
protected class |
WebcrawlerConnector.FindHTMLFormHandler
This class is the handler for HTML form parsing during state transitions |
protected class |
WebcrawlerConnector.FindHTMLHrefHandler
This class is the handler for HTML parsing during state transitions |
protected class |
WebcrawlerConnector.FindPreferredRedirectionHandler
This class is the handler for redirection handling during state transitions |
protected class |
WebcrawlerConnector.FindRedirectionHandler
This class is the handler for redirection parsing during state transitions |
protected static class |
WebcrawlerConnector.NameValue
Name/value class |
protected class |
WebcrawlerConnector.OuterContextClass
This class handles the outermost XML context for the feed document. |
protected class |
WebcrawlerConnector.ProcessActivityHTMLHandler
Class that describes HTML handling |
protected class |
WebcrawlerConnector.ProcessActivityLinkHandler
This class is the handler for links that get added into a IProcessActivity object. |
protected class |
WebcrawlerConnector.ProcessActivityRedirectionHandler
Class that describes redirection handling |
protected class |
WebcrawlerConnector.ProcessActivityXMLHandler
Class that describes XML handling |
protected class |
WebcrawlerConnector.RDFContextClass
|
protected class |
WebcrawlerConnector.RDFItemContextClass
|
protected class |
WebcrawlerConnector.RSSChannelContextClass
|
protected class |
WebcrawlerConnector.RSSContextClass
|
protected class |
WebcrawlerConnector.RSSItemContextClass
|
| Field Summary | |
|---|---|
static java.lang.String |
_rcsid
|
static java.lang.String |
ACTIVITY_FETCH
|
static java.lang.String |
ACTIVITY_LOGON_END
|
static java.lang.String |
ACTIVITY_LOGON_START
|
static java.lang.String |
ACTIVITY_ROBOTSPARSE
|
protected static DataCache |
cache
This is where we keep data around between the getVersions() phase and the processDocuments() phase. |
protected int |
connectionTimeoutMilliseconds
Connection timeout, milliseconds. |
protected CookieManager |
cookieManager
The cookie manager used by this instance |
protected CredentialsDescription |
credentialsDescription
The credentials description |
protected DNSManager |
dnsManager
The DNS manager currently used by this instance |
protected static java.lang.String |
FETCH_LOGIN
|
protected static java.lang.String |
FETCH_ROBOTS
|
protected static java.lang.String |
FETCH_STANDARD
|
protected java.lang.String |
from
The email address for this connector instance |
protected static java.lang.String[] |
interestingMimeTypeArray
This represents a list of the mime types that this connector knows how to extract links from. |
protected static java.util.Map |
interestingMimeTypeMap
|
protected boolean |
isInitialized
This flag is set when the instance has been initialized |
static java.lang.String |
REL_LINK
|
static java.lang.String |
REL_REDIRECT
|
protected static int |
RESULT_NO_DOCUMENT
|
protected static int |
RESULT_NO_VERSION
|
protected static int |
RESULT_RETRY_DOCUMENT
|
protected static int |
RESULT_VERSION_NEEDED
|
protected static int |
RESULTSTATUS_FALSE
|
protected static int |
RESULTSTATUS_NOTYETDETERMINED
|
protected static int |
RESULTSTATUS_TRUE
|
protected static int |
ROBOTS_ALL
|
protected static int |
ROBOTS_DATA
|
protected static int |
ROBOTS_NONE
|
protected RobotsManager |
robotsManager
The robots manager currently used by this instance |
protected int |
robotsUsage
Robots usage flag |
protected static int |
SESSIONSTATE_LOGIN
We're in 'login mode' |
protected static int |
SESSIONSTATE_NORMAL
Normal fetch of content document. |
protected int |
socketTimeoutMilliseconds
Socket timeout, milliseconds |
protected ThrottleDescription |
throttleDescription
The throttle description |
protected TrustsDescription |
trustsDescription
The trusts description |
protected static java.util.Map |
understoodProtocols
|
protected java.lang.String |
userAgent
The user-agent for this connector instance |
| Fields inherited from class org.apache.manifoldcf.core.connector.BaseConnector |
|---|
currentContext, params |
| Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector |
|---|
JOBMODE_CONTINUOUS, JOBMODE_ONCEONLY, MODEL_ADD, MODEL_ADD_CHANGE, MODEL_ADD_CHANGE_DELETE, MODEL_ALL, MODEL_PARTIAL |
| Constructor Summary | |
|---|---|
WebcrawlerConnector()
Constructor. |
|
| Method Summary | |
|---|---|
void |
addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
long startTime,
long endTime)
Queue "seed" documents. |
protected java.lang.String[] |
calculateDocumentEvents(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
java.lang.String documentIdentifier)
Calculate events that should be associated with a document. |
java.lang.String |
check()
Check status of connection. |
protected int |
checkFetchAllowed(java.lang.String documentIdentifier,
java.lang.String protocol,
java.lang.String hostIPAddress,
int port,
PageCredentials credential,
org.apache.manifoldcf.core.interfaces.IKeystoreManager trustStore,
java.lang.String hostName,
java.lang.String[] binNames,
long currentTime,
java.lang.String pathString,
org.apache.manifoldcf.crawler.interfaces.IVersionActivity versionActivities,
int connectionLimit)
Check robots to see if fetch is allowed. |
void |
clearThreadContext()
Clear out any state information specific to a given thread. |
protected static void |
compileList(java.util.ArrayList output,
java.util.ArrayList input)
Compile all regexp entries in the passed in list, and add them to the output list. |
void |
deinstall(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
Uninstall the connector. |
void |
disconnect()
Close the connection. |
protected java.lang.String |
doCanonicalization(WebcrawlerConnector.DocumentURLFilter filter,
java.net.URI url)
Code to canonicalize a URL. |
protected boolean |
extractLinks(java.lang.String documentIdentifier,
org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
WebcrawlerConnector.DocumentURLFilter filter)
Code to extract links from an already-fetched document. |
protected FormData |
findHTMLForm(java.lang.String currentURI,
LoginParameters lp)
Find matching HTML form data, if present. |
protected java.lang.String |
findHTMLLinkURI(java.lang.String currentURI,
LoginParameters lp)
Find HTML link URI, if present, making sure specified preference is matched. |
protected static java.util.ArrayList |
findMetadata(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
Read a document specification to yield a map of name/value pairs for metadata |
protected java.lang.String |
findPreferredRedirectionURI(java.lang.String currentURI,
LoginParameters lp)
Find a preferred redirection URI, if it exists |
protected java.lang.String |
findRedirectionURI(java.lang.String currentURI)
Find a redirection URI, if it exists |
protected static java.lang.String[] |
getAcls(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
Grab forced acl out of document specification. |
java.lang.String[] |
getActivitiesList()
Return the list of activities that this connector supports (i.e. |
java.lang.String[] |
getBinNames(java.lang.String documentIdentifier)
Get the bin name string for a document identifier. |
int |
getConnectorModel()
Tell the world what model this connector uses for getDocumentIdentifiers(). |
java.lang.String[] |
getDocumentVersions(java.lang.String[] documentIdentifiers,
java.lang.String[] oldVersions,
org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
int jobMode,
boolean usesDefaultAuthority)
Get document versions given an array of document identifiers. |
java.lang.String |
getJSPFolder()
Return the path for the UI interface JSP elements. |
int |
getMaxDocumentRequest()
Get the maximum number of documents to amalgamate together into one batch, for this connector. |
protected PageCredentials |
getPageCredential(java.lang.String documentIdentifier)
Get the page credentials for a given document identifier (URL) |
java.lang.String[] |
getRelationshipTypes()
Return the list of relationship types that this connector recognizes. |
protected SequenceCredentials |
getSequenceCredential(java.lang.String documentIdentifier)
Get the sequence credentials for a given document identifier (URL) |
protected void |
getSession()
Start a session |
protected org.apache.manifoldcf.core.interfaces.IKeystoreManager |
getTrustStore(java.lang.String documentIdentifier)
Get the trust store for a given document identifier (URL) |
protected void |
handleHTML(java.lang.String documentURI,
IHTMLHandler handler)
Handle document references from HTML |
protected void |
handleRedirects(java.lang.String documentURI,
IRedirectionHandler handler)
Handle extracting the redirect link from a redirect response. |
protected void |
handleXML(java.lang.String documentURI,
IXMLHandler handler)
Handle document references from XML. |
void |
install(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
Install the connector. |
protected boolean |
isContentInteresting(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities,
java.lang.String documentIdentifier,
int response,
java.lang.String contentType)
Code to check if data is interesting, based on response code and content type. |
protected boolean |
isDataIngestable(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities,
java.lang.String documentIdentifier)
Code to check if an already-fetched document should be ingested. |
protected boolean |
isDocumentText(java.lang.String documentURI)
Is the document text, as far as we can tell? |
protected static boolean |
isStrange(byte x)
Check if character is not typical ASCII. |
protected static boolean |
isText(byte[] beginChunk,
int chunkLength)
Test to see if a document is text or not. |
protected static boolean |
isWhiteSpace(byte x)
Check if a byte is a whitespace character. |
protected int |
lookupIPAddress(java.lang.String documentIdentifier,
org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities,
java.lang.String hostName,
long currentTime,
java.lang.StringBuffer ipAddressBuffer)
Look up an ipaddress given a non-canonical host name. |
protected java.lang.String |
makeDNSEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
java.lang.String hostNameKey)
Calculate the event name for DNS access. |
protected java.lang.String |
makeDocumentIdentifier(java.lang.String parentIdentifier,
java.lang.String rawURL,
WebcrawlerConnector.DocumentURLFilter filter)
Convert an absolute or relative URL to a document identifier. |
protected java.lang.String |
makeRobotsEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity versionActivities,
java.lang.String robotsKey)
Construct a name for the global web-connector robots event. |
protected static java.lang.String |
makeRobotsKey(java.lang.String protocol,
java.lang.String hostName,
int port)
Construct the robots key for a host. |
protected java.lang.String |
makeSessionLoginEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
java.lang.String sequenceKey)
Calculate the event name for session login. |
void |
outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
java.lang.String tabName)
Output the configuration body section. |
void |
outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
java.util.ArrayList tabsArray)
Output the configuration header section. |
void |
outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds,
java.lang.String tabName)
Output the specification body section. |
void |
outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds,
java.util.ArrayList tabsArray)
Output the specification header section. |
protected static void |
pack(java.lang.StringBuffer output,
java.lang.String value,
char delimiter)
Stuffer for packing a single string with an end delimiter |
protected static void |
packFixedList(java.lang.StringBuffer output,
java.lang.String[] values,
char delimiter)
Stuffer for packing lists of fixed length |
protected static void |
packList(java.lang.StringBuffer output,
java.util.ArrayList values,
char delimiter)
Stuffer for packing lists of variable length |
protected static void |
packList(java.lang.StringBuffer output,
java.lang.String[] values,
char delimiter)
Another stuffer for packing lists of variable length |
void |
poll()
This method is periodically called for all connectors that are connected but not in active use. |
java.lang.String |
processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
Process a configuration post. |
void |
processDocuments(java.lang.String[] documentIdentifiers,
java.lang.String[] versions,
org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
boolean[] scanOnly)
Process a set of documents. |
java.lang.String |
processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
Process a specification post. |
void |
releaseDocumentVersions(java.lang.String[] documentIdentifiers,
java.lang.String[] versions)
Free a set of documents. |
protected static java.util.ArrayList |
stringToArray(java.lang.String input)
Read a string as a sequence of individual expressions, urls, etc. |
protected static int |
unpack(java.lang.StringBuffer sb,
java.lang.String value,
int startPosition,
char delimiter)
Unstuffer for the above. |
protected static int |
unpackFixedList(java.lang.String[] output,
java.lang.String value,
int startPosition,
char delimiter)
Unstuffer for unpacking lists of fixed length |
protected static int |
unpackList(java.util.ArrayList output,
java.lang.String value,
int startPosition,
char delimiter)
Unstuffer for unpacking lists of variable length. |
void |
viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
View configuration. |
void |
viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
View specification. |
| Methods inherited from class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector |
|---|
addSeedDocuments, getDocumentIdentifiers, getDocumentIdentifiers, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getRemainingDocumentIdentifiers, processDocuments, requestInfo |
| Methods inherited from class org.apache.manifoldcf.core.connector.BaseConnector |
|---|
connect, getConfiguration, setThreadContext |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Methods inherited from interface org.apache.manifoldcf.core.interfaces.IConnector |
|---|
connect, getConfiguration, setThreadContext |
| Field Detail |
|---|
public static final java.lang.String _rcsid
protected static final int RESULTSTATUS_FALSE
protected static final int RESULTSTATUS_TRUE
protected static final int RESULTSTATUS_NOTYETDETERMINED
protected static final java.lang.String[] interestingMimeTypeArray
protected static final java.util.Map interestingMimeTypeMap
protected static final java.util.Map understoodProtocols
protected static final int ROBOTS_NONE
protected static final int ROBOTS_DATA
protected static final int ROBOTS_ALL
public static final java.lang.String REL_LINK
public static final java.lang.String REL_REDIRECT
public static final java.lang.String ACTIVITY_FETCH
public static final java.lang.String ACTIVITY_ROBOTSPARSE
public static final java.lang.String ACTIVITY_LOGON_START
public static final java.lang.String ACTIVITY_LOGON_END
protected static final java.lang.String FETCH_ROBOTS
protected static final java.lang.String FETCH_STANDARD
protected static final java.lang.String FETCH_LOGIN
protected int robotsUsage
protected java.lang.String userAgent
protected java.lang.String from
protected int connectionTimeoutMilliseconds
protected int socketTimeoutMilliseconds
protected ThrottleDescription throttleDescription
protected CredentialsDescription credentialsDescription
protected TrustsDescription trustsDescription
protected RobotsManager robotsManager
protected DNSManager dnsManager
protected CookieManager cookieManager
protected boolean isInitialized
protected static DataCache cache
protected static final int SESSIONSTATE_NORMAL
protected static final int SESSIONSTATE_LOGIN
protected static final int RESULT_NO_DOCUMENT
protected static final int RESULT_NO_VERSION
protected static final int RESULT_VERSION_NEEDED
protected static final int RESULT_RETRY_DOCUMENT
| Constructor Detail |
|---|
public WebcrawlerConnector()
| Method Detail |
|---|
public int getConnectorModel()
getConnectorModel in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnectorgetConnectorModel in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectorpublic java.lang.String getJSPFolder()
public void install(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
install in interface org.apache.manifoldcf.core.interfaces.IConnectorinstall in class org.apache.manifoldcf.core.connector.BaseConnectorthreadContext - is the current thread context.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
public void deinstall(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
deinstall in interface org.apache.manifoldcf.core.interfaces.IConnectordeinstall in class org.apache.manifoldcf.core.connector.BaseConnectorthreadContext - is the current thread context.
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionpublic java.lang.String[] getActivitiesList()
getActivitiesList in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnectorgetActivitiesList in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectorpublic java.lang.String[] getRelationshipTypes()
getRelationshipTypes in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnectorgetRelationshipTypes in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectorpublic void clearThreadContext()
clearThreadContext in interface org.apache.manifoldcf.core.interfaces.IConnectorclearThreadContext in class org.apache.manifoldcf.core.connector.BaseConnector
protected void getSession()
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFException
public void poll()
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
poll in interface org.apache.manifoldcf.core.interfaces.IConnectorpoll in class org.apache.manifoldcf.core.connector.BaseConnectororg.apache.manifoldcf.core.interfaces.ManifoldCFException
public java.lang.String check()
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
check in interface org.apache.manifoldcf.core.interfaces.IConnectorcheck in class org.apache.manifoldcf.core.connector.BaseConnectororg.apache.manifoldcf.core.interfaces.ManifoldCFException
public void disconnect()
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
disconnect in interface org.apache.manifoldcf.core.interfaces.IConnectordisconnect in class org.apache.manifoldcf.core.connector.BaseConnectororg.apache.manifoldcf.core.interfaces.ManifoldCFExceptionpublic java.lang.String[] getBinNames(java.lang.String documentIdentifier)
getBinNames in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnectorgetBinNames in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectordocumentIdentifier - is the document identifier.
public void addSeedDocuments(org.apache.manifoldcf.crawler.interfaces.ISeedingActivity activities,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
long startTime,
long endTime)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
addSeedDocuments in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectoractivities - is the interface this method should use to perform whatever framework actions are desired.spec - is a document specification (that comes from the job).startTime - is the beginning of the time range to consider, inclusive.endTime - is the end of the time range to consider, exclusive.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
public java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers,
java.lang.String[] oldVersions,
org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
int jobMode,
boolean usesDefaultAuthority)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
getDocumentVersions in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnectorgetDocumentVersions in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectordocumentIdentifiers - is the array of local document identifiers, as understood by this connector.oldVersions - is the corresponding array of version strings that have been saved for the document identifiers.
A null value indicates that this is a first-time fetch, while an empty string indicates that the previous document
had an empty version string.activities - is the interface this method should use to perform whatever framework actions are desired.spec - is the current document specification for the current job. If there is a dependency on this
specification, then the version string should include the pertinent data, so that reingestion will occur
when the specification changes. This is primarily useful for metadata.jobMode - is an integer describing how the job is being run, whether continuous or once-only.usesDefaultAuthority - will be true only if the authority in use for these documents is the default one.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
public void processDocuments(java.lang.String[] documentIdentifiers,
java.lang.String[] versions,
org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
boolean[] scanOnly)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
processDocuments in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectordocumentIdentifiers - is the set of document identifiers to process.activities - is the interface this method should use to queue up new document references
and ingest documents.spec - is the document specification.scanOnly - is an array corresponding to the document identifiers. It is set to true to indicate when the processing
should only find other references, and should not actually call the ingestion methods.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
public void releaseDocumentVersions(java.lang.String[] documentIdentifiers,
java.lang.String[] versions)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
releaseDocumentVersions in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnectorreleaseDocumentVersions in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectordocumentIdentifiers - is the set of document identifiers.versions - is the corresponding set of version identifiers (individual identifiers may be null).
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionpublic int getMaxDocumentRequest()
getMaxDocumentRequest in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnectorgetMaxDocumentRequest in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
public void outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
java.util.ArrayList tabsArray)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
java.io.IOException
outputConfigurationHeader in interface org.apache.manifoldcf.core.interfaces.IConnectoroutputConfigurationHeader in class org.apache.manifoldcf.core.connector.BaseConnectorthreadContext - is the local thread context.out - is the output to which any HTML should be sent.parameters - are the configuration parameters, as they currently exist, for this connection being configured.tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException
public void outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
java.lang.String tabName)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
java.io.IOException
public java.lang.String processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
processConfigurationPost in interface org.apache.manifoldcf.core.interfaces.IConnectorprocessConfigurationPost in class org.apache.manifoldcf.core.connector.BaseConnectorthreadContext - is the local thread context.variableContext - is the set of variables available from the post, including binary file post information.parameters - are the configuration parameters, as they currently exist, for this connection being configured.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
public void viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
java.io.IOException
viewConfiguration in interface org.apache.manifoldcf.core.interfaces.IConnectorviewConfiguration in class org.apache.manifoldcf.core.connector.BaseConnectorthreadContext - is the local thread context.out - is the output to which any HTML should be sent.parameters - are the configuration parameters, as they currently exist, for this connection being configured.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException
public void outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds,
java.util.ArrayList tabsArray)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
java.io.IOException
outputSpecificationHeader in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnectoroutputSpecificationHeader in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectorout - is the output to which any HTML should be sent.ds - is the current document specification for this job.tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException
public void outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds,
java.lang.String tabName)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
java.io.IOException
public java.lang.String processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
processSpecificationPost in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnectorprocessSpecificationPost in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectorvariableContext - contains the post data, including binary file-upload information.ds - is the current document specification for this job.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
public void viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
java.io.IOException
viewSpecification in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnectorviewSpecification in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnectorout - is the output to which any HTML should be sent.ds - is the current document specification for this job.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException
protected java.lang.String makeSessionLoginEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
java.lang.String sequenceKey)
protected java.lang.String makeDNSEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
java.lang.String hostNameKey)
protected int lookupIPAddress(java.lang.String documentIdentifier,
org.apache.manifoldcf.crawler.interfaces.IVersionActivity activities,
java.lang.String hostName,
long currentTime,
java.lang.StringBuffer ipAddressBuffer)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
protected static java.lang.String makeRobotsKey(java.lang.String protocol,
java.lang.String hostName,
int port)
protected java.lang.String makeRobotsEventName(org.apache.manifoldcf.crawler.interfaces.INamingActivity versionActivities,
java.lang.String robotsKey)
protected int checkFetchAllowed(java.lang.String documentIdentifier,
java.lang.String protocol,
java.lang.String hostIPAddress,
int port,
PageCredentials credential,
org.apache.manifoldcf.core.interfaces.IKeystoreManager trustStore,
java.lang.String hostName,
java.lang.String[] binNames,
long currentTime,
java.lang.String pathString,
org.apache.manifoldcf.crawler.interfaces.IVersionActivity versionActivities,
int connectionLimit)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
protected java.lang.String makeDocumentIdentifier(java.lang.String parentIdentifier,
java.lang.String rawURL,
WebcrawlerConnector.DocumentURLFilter filter)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
parentIdentifier - the identifier of the document in which the raw url was found, or null if none.rawURL - the starting, un-normalized, un-canonicalized URL.filter - the filter object, used to remove unmatching URLs.
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected java.lang.String doCanonicalization(WebcrawlerConnector.DocumentURLFilter filter,
java.net.URI url)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
java.net.URISyntaxException
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.net.URISyntaxException
protected boolean isContentInteresting(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities,
java.lang.String documentIdentifier,
int response,
java.lang.String contentType)
throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption,
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected boolean isDataIngestable(org.apache.manifoldcf.crawler.interfaces.IFingerprintActivity activities,
java.lang.String documentIdentifier)
throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption,
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected java.lang.String findRedirectionURI(java.lang.String currentURI)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected FormData findHTMLForm(java.lang.String currentURI,
LoginParameters lp)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected java.lang.String findPreferredRedirectionURI(java.lang.String currentURI,
LoginParameters lp)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected java.lang.String findHTMLLinkURI(java.lang.String currentURI,
LoginParameters lp)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected boolean extractLinks(java.lang.String documentIdentifier,
org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
WebcrawlerConnector.DocumentURLFilter filter)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
protected void handleRedirects(java.lang.String documentURI,
IRedirectionHandler handler)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected void handleXML(java.lang.String documentURI,
IXMLHandler handler)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption
protected void handleHTML(java.lang.String documentURI,
IHTMLHandler handler)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected boolean isDocumentText(java.lang.String documentURI)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected static boolean isText(byte[] beginChunk,
int chunkLength)
protected static boolean isStrange(byte x)
protected static boolean isWhiteSpace(byte x)
protected static java.util.ArrayList stringToArray(java.lang.String input)
protected static void compileList(java.util.ArrayList output,
java.util.ArrayList input)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionprotected PageCredentials getPageCredential(java.lang.String documentIdentifier)
protected SequenceCredentials getSequenceCredential(java.lang.String documentIdentifier)
protected org.apache.manifoldcf.core.interfaces.IKeystoreManager getTrustStore(java.lang.String documentIdentifier)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionprotected static java.lang.String[] getAcls(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
spec - is the document specification.
protected static java.util.ArrayList findMetadata(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.core.interfaces.ManifoldCFException
protected static void pack(java.lang.StringBuffer output,
java.lang.String value,
char delimiter)
protected static int unpack(java.lang.StringBuffer sb,
java.lang.String value,
int startPosition,
char delimiter)
protected static void packFixedList(java.lang.StringBuffer output,
java.lang.String[] values,
char delimiter)
protected static int unpackFixedList(java.lang.String[] output,
java.lang.String value,
int startPosition,
char delimiter)
protected static void packList(java.lang.StringBuffer output,
java.util.ArrayList values,
char delimiter)
protected static void packList(java.lang.StringBuffer output,
java.lang.String[] values,
char delimiter)
protected static int unpackList(java.util.ArrayList output,
java.lang.String value,
int startPosition,
char delimiter)
output - is the array into which the unpacked output is written.value - is the value to unpack.startPosition - is the place to start the unpack.delimiter - is the character to use between values.
protected java.lang.String[] calculateDocumentEvents(org.apache.manifoldcf.crawler.interfaces.INamingActivity activities,
java.lang.String documentIdentifier)
|
||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||