org.htmlparser.parserapplications
public class SiteCapturer extends Object
Field Summary | |
---|---|
protected boolean | mCaptureResources
If true , save resources locally too,
otherwise, leave resource links pointing to original page. |
protected HashSet | mCopied
The set of resources already copied.
|
protected NodeFilter | mFilter
The filter to apply to the nodes retrieved. |
protected HashSet | mFinished
The set of pages already captured.
|
protected ArrayList | mImages
The list of resources to copy.
|
protected ArrayList | mPages
The list of pages to capture.
|
protected Parser | mParser
The parser to use for processing. |
protected String | mSource
The web site to capture.
|
protected String | mTarget
The local directory to capture to.
|
protected int | TRANSFER_SIZE
Copy buffer size.
|
Constructor Summary | |
---|---|
SiteCapturer()
Create a web site capturer. |
Method Summary | |
---|---|
void | capture()
Perform the capture. |
protected void | copy()
Copy a resource (image) locally.
|
protected String | decode(String raw)
Unescape a URL to form a file name.
|
boolean | getCaptureResources()
Getter for property captureResources.
|
NodeFilter | getFilter() Getter for property filter. |
String | getSource()
Getter for property source. |
String | getTarget()
Getter for property target. |
protected boolean | isHtml(String link)
Returns true if the link contains text/html content. |
protected boolean | isToBeCaptured(String link)
Returns true if the link is one we are interested in. |
static void | main(String[] args)
Mainline to capture a web site locally. |
protected String | makeLocalLink(String link, String current)
Converts a link to local.
|
protected void | process(NodeFilter filter)
Process a single page. |
void | setCaptureResources(boolean capture)
Setter for property captureResources. |
void | setFilter(NodeFilter filter) Setter for property filter. |
void | setSource(String source)
Setter for property source.
|
void | setTarget(String target)
Setter for property target.
|
true
, save resources locally too,
otherwise, leave resource links pointing to original page.Parameters: raw The escaped URI.
Returns: The native URI.
true
, the images and other resources referenced by
the site and within the base URL tree are also copied locally to the
target directory. If false
, the image links are left 'as
is', still refering to the original site.Returns: Value of property captureResources.
Returns: Value of property filter.
Returns: Value of property source.
Returns: Value of property target.
true
if the link contains text/html content.Parameters: link The URL to check for content type.
Returns: true
if the HTTP header indicates the type is
"text/html".
Throws: ParserException If the supplied URL can't be read from.
true
if the link is one we are interested in.Parameters: link The link to be checked.
Returns: true
if the link has the source URL as a prefix
and doesn't contain '?' or '#'; the former because we won't be able to
handle server side queries in the static target directory structure and
the latter because presumably the full page with that reference has
already been captured previously. This performs a case insensitive
comparison, which is cheating really, but it's cheap.
Parameters: args The command line arguments. There are three arguments the web site to capture, the local directory to save it to, and a flag (true or false) to indicate whether resources such as images and video are to be captured as well. These are requested via dialog boxes if not supplied.
Throws: MalformedURLException If the supplied URL is invalid. IOException If an error occurs reading the page or resources.
Parameters: link The link to make relative. current The current page URL, or empty if it's an absolute URL that needs to be converted.
Returns: The URL relative to the current page.
Parameters: filter The filter to apply to the collected nodes.
Throws: ParserException If a parse error occurs.
Parameters: capture New value of property captureResources.
Parameters: filter New value of property filter.
Parameters: source New value of property source.
Parameters: target New value of property target.
HTML Parser is an open source library released under LGPL. | |