org.apache.manifoldcf.crawler.connectors.filesystem
Class FileConnector

java.lang.Object
  extended by org.apache.manifoldcf.core.connector.BaseConnector
      extended by org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
          extended by org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector
All Implemented Interfaces:
org.apache.manifoldcf.core.interfaces.IConnector, org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector

public class FileConnector
extends org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector

This is the "repository connector" for a file system. It's a relative of the share crawler, and should have comparable basic functionality, with the exception of the ability to use ActiveDirectory and look at other shares.


Nested Class Summary
protected static class FileConnector.IdentifierStream
          Document identifier stream.
 
Field Summary
static java.lang.String _rcsid
           
protected static java.lang.String[] activitiesList
           
protected static java.lang.String ACTIVITY_READ
           
protected static java.lang.String RELATIONSHIP_CHILD
           
 
Fields inherited from class org.apache.manifoldcf.core.connector.BaseConnector
currentContext, params
 
Fields inherited from interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
JOBMODE_CONTINUOUS, JOBMODE_ONCEONLY, MODEL_ADD, MODEL_ADD_CHANGE, MODEL_ADD_CHANGE_DELETE, MODEL_ALL, MODEL_PARTIAL
 
Constructor Summary
FileConnector()
          Constructor.
 
Method Summary
protected static boolean checkInclude(java.io.File file, java.lang.String fileName, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification documentSpecification)
          Check if a file or directory should be included, given a document specification.
protected static boolean checkIngest(java.io.File file, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification documentSpecification)
          Check if a file should be ingested, given a document specification.
protected static boolean checkMatch(java.lang.String sourceMatch, int sourceIndex, java.lang.String match)
          Check a match between two strings with wildcards.
protected  java.lang.String convertToURI(java.lang.String documentIdentifier)
          Convert a document identifier to a URI.
 java.lang.String[] getActivitiesList()
          List the activities we might report on.
 java.lang.String[] getBinNames(java.lang.String documentIdentifier)
          For any given document, list the bins that it is a member of.
 org.apache.manifoldcf.crawler.interfaces.IDocumentIdentifierStream getDocumentIdentifiers(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec, long startTime, long endTime)
          Given a document specification, get either a list of starting document identifiers (seeds), or a list of changes (deltas), depending on whether this is a "crawled" connector or not.
 java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
          Get document versions given an array of document identifiers.
 java.lang.String getJSPFolder()
          Return the path for the UI interface JSP elements.
 java.lang.String[] getRelationshipTypes()
          Return the list of relationship types that this connector recognizes.
protected static int matchSubPath(java.lang.String subPath, java.lang.String fullPath)
          Match a sub-path.
 void outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.lang.String tabName)
          Output the configuration body section.
 void outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.core.interfaces.ConfigParams parameters, java.util.ArrayList tabsArray)
          Output the configuration header section.
 void outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds, java.lang.String tabName)
          Output the specification body section.
 void outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds, java.util.ArrayList tabsArray)
          Output the specification header section.
protected static boolean processCheck(boolean caseSensitive, java.lang.String sourceMatch, int sourceIndex, java.lang.String match, int matchIndex)
          Recursive worker method for checkMatch.
 java.lang.String processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
          Process a configuration post.
 void processDocuments(java.lang.String[] documentIdentifiers, java.lang.String[] versions, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec, boolean[] scanOnly)
          Process a set of documents.
 java.lang.String processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
          Process a specification post.
 void viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
          View configuration.
 void viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out, org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
          View specification.
 
Methods inherited from class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
addSeedDocuments, addSeedDocuments, getConnectorModel, getDocumentIdentifiers, getDocumentVersions, getDocumentVersions, getDocumentVersions, getDocumentVersions, getMaxDocumentRequest, getRemainingDocumentIdentifiers, processDocuments, releaseDocumentVersions, requestInfo
 
Methods inherited from class org.apache.manifoldcf.core.connector.BaseConnector
check, clearThreadContext, connect, deinstall, disconnect, getConfiguration, install, poll, setThreadContext
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.manifoldcf.core.interfaces.IConnector
check, clearThreadContext, connect, deinstall, disconnect, getConfiguration, install, poll, setThreadContext
 

Field Detail

_rcsid

public static final java.lang.String _rcsid
See Also:
Constant Field Values

ACTIVITY_READ

protected static final java.lang.String ACTIVITY_READ
See Also:
Constant Field Values

RELATIONSHIP_CHILD

protected static final java.lang.String RELATIONSHIP_CHILD
See Also:
Constant Field Values

activitiesList

protected static final java.lang.String[] activitiesList
Constructor Detail

FileConnector

public FileConnector()
Constructor.

Method Detail

getJSPFolder

public java.lang.String getJSPFolder()
Return the path for the UI interface JSP elements. These JSP's must be provided to allow the connector to be configured, and to permit it to present document filtering specification information in the UI. This method should return the name of the folder, under the /connectors/ area, where the appropriate JSP's can be found. The name should NOT have a slash in it.

Returns:
the folder part

getRelationshipTypes

public java.lang.String[] getRelationshipTypes()
Return the list of relationship types that this connector recognizes.

Specified by:
getRelationshipTypes in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getRelationshipTypes in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Returns:
the list.

getActivitiesList

public java.lang.String[] getActivitiesList()
List the activities we might report on.

Specified by:
getActivitiesList in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getActivitiesList in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector

getBinNames

public java.lang.String[] getBinNames(java.lang.String documentIdentifier)
For any given document, list the bins that it is a member of. For the file system, this would be typically just a blank value, but since we use this connector for testing, I have it returning TWO values for each document, so I can set up tests to see how the scheduler behaves under those conditions.

Specified by:
getBinNames in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
getBinNames in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector

convertToURI

protected java.lang.String convertToURI(java.lang.String documentIdentifier)
                                 throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Convert a document identifier to a URI. The URI is the URI that will be the unique key from the search index, and will be presented to the user as part of the search results.

Parameters:
documentIdentifier - is the document identifier.
Returns:
the document uri.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

getDocumentIdentifiers

public org.apache.manifoldcf.crawler.interfaces.IDocumentIdentifierStream getDocumentIdentifiers(org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
                                                                                                 long startTime,
                                                                                                 long endTime)
                                                                                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Given a document specification, get either a list of starting document identifiers (seeds), or a list of changes (deltas), depending on whether this is a "crawled" connector or not. These document identifiers will be loaded into the job's queue at the beginning of the job's execution. This method can return changes only (because it is provided a time range). For full recrawls, the start time is always zero. Note that it is always ok to return MORE documents rather than less with this method.

Overrides:
getDocumentIdentifiers in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
spec - is a document specification (that comes from the job).
startTime - is the beginning of the time range to consider, inclusive.
endTime - is the end of the time range to consider, exclusive.
Returns:
the stream of local document identifiers that should be added to the queue.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

getDocumentVersions

public java.lang.String[] getDocumentVersions(java.lang.String[] documentIdentifiers,
                                              org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec)
                                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                              org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Get document versions given an array of document identifiers. This method is called for EVERY document that is considered. It is therefore important to perform as little work as possible here.

Overrides:
getDocumentVersions in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
documentIdentifiers - is the array of local document identifiers, as understood by this connector.
Returns:
the corresponding version strings, with null in the places where the document no longer exists. Empty version strings indicate that there is no versioning ability for the corresponding document, and the document will always be processed.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

processDocuments

public void processDocuments(java.lang.String[] documentIdentifiers,
                             java.lang.String[] versions,
                             org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                             org.apache.manifoldcf.crawler.interfaces.DocumentSpecification spec,
                             boolean[] scanOnly)
                      throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                             org.apache.manifoldcf.agents.interfaces.ServiceInterruption
Process a set of documents. This is the method that should cause each document to be fetched, processed, and the results either added to the queue of documents for the current job, and/or entered into the incremental ingestion manager. The document specification allows this class to filter what is done based on the job.

Overrides:
processDocuments in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
documentIdentifiers - is the set of document identifiers to process.
activities - is the interface this method should use to queue up new document references and ingest documents.
spec - is the document specification.
scanOnly - is an array corresponding to the document identifiers. It is set to true to indicate when the processing should only find other references, and should not actually call the ingestion methods.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
org.apache.manifoldcf.agents.interfaces.ServiceInterruption

outputConfigurationHeader

public void outputConfigurationHeader(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                      org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                      org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
                                      java.util.ArrayList tabsArray)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                      java.io.IOException
Output the configuration header section. This method is called in the head section of the connector's configuration page. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the configuration editing HTML.

Specified by:
outputConfigurationHeader in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
outputConfigurationHeader in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
out - is the output to which any HTML should be sent.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

outputConfigurationBody

public void outputConfigurationBody(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                    org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                    org.apache.manifoldcf.core.interfaces.ConfigParams parameters,
                                    java.lang.String tabName)
                             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                    java.io.IOException
Output the configuration body section. This method is called in the body section of the connector's configuration page. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate , , and
tags. The name of the form is "editconnection".

Specified by:
outputConfigurationBody in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
outputConfigurationBody in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
out - is the output to which any HTML should be sent.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
tabName - is the current tab name.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

processConfigurationPost

public java.lang.String processConfigurationPost(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                                 org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
                                                 org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
                                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Process a configuration post. This method is called at the start of the connector's configuration page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the configuration parameters accordingly. The name of the posted form is "editconnection".

Specified by:
processConfigurationPost in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
processConfigurationPost in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
variableContext - is the set of variables available from the post, including binary file post information.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
Returns:
null if all is well, or a string error message if there is an error that should prevent saving of the connection (and cause a redirection to an error page).
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

viewConfiguration

public void viewConfiguration(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                              org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                              org.apache.manifoldcf.core.interfaces.ConfigParams parameters)
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                              java.io.IOException
View configuration. This method is called in the body section of the connector's view configuration page. Its purpose is to present the connection information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate and tags.

Specified by:
viewConfiguration in interface org.apache.manifoldcf.core.interfaces.IConnector
Overrides:
viewConfiguration in class org.apache.manifoldcf.core.connector.BaseConnector
Parameters:
threadContext - is the local thread context.
out - is the output to which any HTML should be sent.
parameters - are the configuration parameters, as they currently exist, for this connection being configured.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

outputSpecificationHeader

public void outputSpecificationHeader(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                      org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds,
                                      java.util.ArrayList tabsArray)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                      java.io.IOException
Output the specification header section. This method is called in the head section of a job page which has selected a repository connection of the current type. Its purpose is to add the required tabs to the list, and to output any javascript methods that might be needed by the job editing HTML.

Specified by:
outputSpecificationHeader in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
outputSpecificationHeader in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
tabsArray - is an array of tab names. Add to this array any tab names that are specific to the connector.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

outputSpecificationBody

public void outputSpecificationBody(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                                    org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds,
                                    java.lang.String tabName)
                             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                    java.io.IOException
Output the specification body section. This method is called in the body section of a job page which has selected a repository connection of the current type. Its purpose is to present the required form elements for editing. The coder can presume that the HTML that is output from this configuration will be within appropriate , , and tags. The name of the form is "editjob".

Specified by:
outputSpecificationBody in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
outputSpecificationBody in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
tabName - is the current tab name.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

processSpecificationPost

public java.lang.String processSpecificationPost(org.apache.manifoldcf.core.interfaces.IPostParameters variableContext,
                                                 org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
                                          throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Process a specification post. This method is called at the start of job's edit or view page, whenever there is a possibility that form data for a connection has been posted. Its purpose is to gather form information and modify the document specification accordingly. The name of the posted form is "editjob".

Specified by:
processSpecificationPost in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
processSpecificationPost in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
variableContext - contains the post data, including binary file-upload information.
ds - is the current document specification for this job.
Returns:
null if all is well, or a string error message if there is an error that should prevent saving of the job (and cause a redirection to an error page).
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

viewSpecification

public void viewSpecification(org.apache.manifoldcf.core.interfaces.IHTTPOutput out,
                              org.apache.manifoldcf.crawler.interfaces.DocumentSpecification ds)
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                              java.io.IOException
View specification. This method is called in the body section of a job's view page. Its purpose is to present the document specification information to the user. The coder can presume that the HTML that is output from this configuration will be within appropriate and tags.

Specified by:
viewSpecification in interface org.apache.manifoldcf.crawler.interfaces.IRepositoryConnector
Overrides:
viewSpecification in class org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector
Parameters:
out - is the output to which any HTML should be sent.
ds - is the current document specification for this job.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException
java.io.IOException

checkInclude

protected static boolean checkInclude(java.io.File file,
                                      java.lang.String fileName,
                                      org.apache.manifoldcf.crawler.interfaces.DocumentSpecification documentSpecification)
                               throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Check if a file or directory should be included, given a document specification.

Parameters:
fileName - is the canonical file name.
documentSpecification - is the specification.
Returns:
true if it should be included.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

checkIngest

protected static boolean checkIngest(java.io.File file,
                                     org.apache.manifoldcf.crawler.interfaces.DocumentSpecification documentSpecification)
                              throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
Check if a file should be ingested, given a document specification. It is presumed that documents that do not pass checkInclude() will be checked with this method.

Parameters:
file - is the file.
documentSpecification - is the specification.
Throws:
org.apache.manifoldcf.core.interfaces.ManifoldCFException

matchSubPath

protected static int matchSubPath(java.lang.String subPath,
                                  java.lang.String fullPath)
Match a sub-path. The sub-path must match the complete starting part of the full path, in a path sense. The returned value should point into the file name beyond the end of the matched path, or be -1 if there is no match.

Parameters:
subPath - is the sub path.
fullPath - is the full path.
Returns:
the index of the start of the remaining part of the full path, or -1.

checkMatch

protected static boolean checkMatch(java.lang.String sourceMatch,
                                    int sourceIndex,
                                    java.lang.String match)
Check a match between two strings with wildcards.

Parameters:
sourceMatch - is the expanded string (no wildcards)
sourceIndex - is the starting point in the expanded string.
match - is the wildcard-based string.
Returns:
true if there is a match.

processCheck

protected static boolean processCheck(boolean caseSensitive,
                                      java.lang.String sourceMatch,
                                      int sourceIndex,
                                      java.lang.String match,
                                      int matchIndex)
Recursive worker method for checkMatch. Returns 'true' if there is a path that consumes both strings in their entirety in a matched way.

Parameters:
caseSensitive - is true if file names are case sensitive.
sourceMatch - is the source string (w/o wildcards)
sourceIndex - is the current point in the source string.
match - is the match string (w/wildcards)
matchIndex - is the current point in the match string.
Returns:
true if there is a match.