org.cyberneko.html
public class HTMLScanner extends Object implements XMLDocumentScanner, XMLLocator, HTMLComponent
This component recognizes the following features:
This component recognizes the following properties:
Version: $Id: HTMLScanner.java,v 1.19 2005/06/14 05:52:37 andyc Exp $
See Also: HTMLElements
Nested Class Summary | |
---|---|
class | HTMLScanner.ContentScanner
The primary HTML document scanner.
|
static class | HTMLScanner.CurrentEntity
Current entity.
|
protected static class | HTMLScanner.LocationItem
Location infoset item.
|
static class | HTMLScanner.PlaybackInputStream
A playback input stream. |
interface | HTMLScanner.Scanner
Basic scanner interface.
|
class | HTMLScanner.SpecialScanner
Special scanner used for elements whose content needs to be scanned
as plain text, ignoring markup such as elements and entity references.
|
Field Summary | |
---|---|
protected static String | AUGMENTATIONS Include infoset augmentations. |
static String | CDATA_SECTIONS Scan CDATA sections. |
protected static boolean | DEBUG_CALLBACKS Set to true to debug callbacks. |
protected static int | DEFAULT_BUFFER_SIZE Default buffer size. |
protected static String | DEFAULT_ENCODING Default encoding. |
protected static String | DOCTYPE_PUBID Doctype declaration public identifier. |
protected static String | DOCTYPE_SYSID Doctype declaration system identifier. |
protected static String | ERROR_REPORTER Error reporter. |
protected boolean | fAugmentations Augmentations. |
protected int | fBeginColumnNumber Beginning column number. |
protected int | fBeginLineNumber Beginning line number. |
protected HTMLScanner.PlaybackInputStream | fByteStream The playback byte stream. |
protected boolean | fCDATASections CDATA sections. |
protected HTMLScanner.Scanner | fContentScanner Content scanner. |
protected HTMLScanner.CurrentEntity | fCurrentEntity Current entity. |
protected Stack | fCurrentEntityStack The current entity stack. |
protected String | fDefaultIANAEncoding Default encoding. |
protected String | fDoctypePubid Doctype declaration public identifier. |
protected String | fDoctypeSysid Doctype declaration system identifier. |
protected XMLDocumentHandler | fDocumentHandler The document handler. |
protected int | fElementCount Element count. |
protected int | fElementDepth Element depth. |
protected int | fEndColumnNumber Ending column number. |
protected int | fEndLineNumber Ending line number. |
protected HTMLErrorReporter | fErrorReporter Error reporter. |
protected boolean | fFixWindowsCharRefs Fix Microsoft Windows® character entity references. |
protected String | fIANAEncoding Auto-detected IANA encoding. |
protected boolean | fIgnoreSpecifiedCharset Ignore specified character set. |
protected boolean | fInsertDoctype Insert document type declaration. |
protected boolean | fIso8859Encoding True if the encoding matches "ISO-8859-*". |
protected String | fJavaEncoding Auto-detected Java encoding. |
protected short | fNamesAttrs Modify HTML attribute names. |
protected short | fNamesElems Modify HTML element names. |
protected boolean | fNotifyCharRefs Notify character entity references. |
protected boolean | fNotifyHtmlBuiltinRefs Notify HTML built-in general entity references. |
protected boolean | fNotifyXmlBuiltinRefs Notify XML built-in general entity references. |
protected boolean | fOverrideDoctype Override doctype declaration public and system identifiers. |
protected boolean | fReportErrors Report errors. |
protected HTMLScanner.Scanner | fScanner The current scanner. |
protected short | fScannerState The current scanner state. |
protected boolean | fScriptStripCDATADelims Strip CDATA delimiters from SCRIPT tags. |
protected boolean | fScriptStripCommentDelims Strip comment delimiters from SCRIPT tags. |
protected HTMLScanner.SpecialScanner | fSpecialScanner
Special scanner used for elements whose content needs to be scanned
as plain text, ignoring markup such as elements and entity references.
|
protected XMLString | fString String. |
protected XMLStringBuffer | fStringBuffer String buffer. |
protected boolean | fStyleStripCDATADelims Strip CDATA delimiters from STYLE tags. |
protected boolean | fStyleStripCommentDelims Strip comment delimiters from STYLE tags. |
static String | FIX_MSWINDOWS_REFS Fix Microsoft Windows® character entity references. |
static String | HTML_4_01_FRAMESET_PUBID HTML 4.01 frameset public identifier ("-//W3C//DTD HTML 4.01 Frameset//EN"). |
static String | HTML_4_01_FRAMESET_SYSID HTML 4.01 frameset system identifier ("http://www.w3.org/TR/html4/frameset.dtd"). |
static String | HTML_4_01_STRICT_PUBID HTML 4.01 strict public identifier ("-//W3C//DTD HTML 4.01//EN"). |
static String | HTML_4_01_STRICT_SYSID HTML 4.01 strict system identifier ("http://www.w3.org/TR/html4/strict.dtd"). |
static String | HTML_4_01_TRANSITIONAL_PUBID HTML 4.01 transitional public identifier ("-//W3C//DTD HTML 4.01 Transitional//EN"). |
static String | HTML_4_01_TRANSITIONAL_SYSID HTML 4.01 transitional system identifier ("http://www.w3.org/TR/html4/loose.dtd"). |
static String | IGNORE_SPECIFIED_CHARSET
Ignore specified charset found in the <meta equiv='Content-Type'
content='text/html;charset=…'> tag. |
static String | INSERT_DOCTYPE Insert document type declaration. |
protected static String | NAMES_ATTRS Modify HTML attribute names: { "upper", "lower", "default" }. |
protected static String | NAMES_ELEMS Modify HTML element names: { "upper", "lower", "default" }. |
protected static short | NAMES_LOWERCASE Lowercase HTML names. |
protected static short | NAMES_NO_CHANGE Don't modify HTML names. |
protected static short | NAMES_UPPERCASE Uppercase HTML names. |
static String | NOTIFY_CHAR_REFS Notify character entity references (e.g. |
static String | NOTIFY_HTML_BUILTIN_REFS
Notify handler of built-in entity references (e.g. |
static String | NOTIFY_XML_BUILTIN_REFS
Notify handler of built-in entity references (e.g. |
static String | OVERRIDE_DOCTYPE Override doctype declaration public and system identifiers. |
protected static String | REPORT_ERRORS Report errors. |
static String | SCRIPT_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<! |
static String | SCRIPT_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<! |
protected static short | STATE_CONTENT State: content. |
protected static short | STATE_END_DOCUMENT State: end document. |
protected static short | STATE_MARKUP_BRACKET State: markup bracket. |
protected static short | STATE_START_DOCUMENT State: start document. |
static String | STYLE_STRIP_CDATA_DELIMS
Strip XHTML CDATA delimiters ("<! |
static String | STYLE_STRIP_COMMENT_DELIMS
Strip HTML comment delimiters ("<! |
protected static HTMLEventInfo | SYNTHESIZED_ITEM Synthesized event info item. |
Method Summary | |
---|---|
protected static boolean | builtinXmlRef(String name) Returns true if the name is a built-in XML general entity reference. |
void | cleanup(boolean closeall)
Cleans up used resources. |
static String | expandSystemId(String systemId, String baseSystemId)
Expands a system id and returns the system id as a URI, if
it can be expanded. |
protected static String | fixURI(String str)
Fixes a platform dependent filename to standard URI form.
|
protected int | fixWindowsCharacter(int origChar)
Fixes Microsoft Windows® specific characters.
|
String | getBaseSystemId() Returns the base system identifier. |
int | getCharacterOffset() Returns the current line number. |
int | getColumnNumber() Returns the current column number. |
XMLDocumentHandler | getDocumentHandler() Returns the document handler. |
String | getEncoding() Returns the encoding. |
String | getExpandedSystemId() Returns the expanded system identifier. |
Boolean | getFeatureDefault(String featureId) Returns the default state for a feature. |
int | getLineNumber() Returns the current line number. |
String | getLiteralSystemId() Returns the literal system identifier. |
protected static short | getNamesValue(String value)
Converts HTML names string value to constant value.
|
Object | getPropertyDefault(String propertyId) Returns the default state for a property. |
String | getPublicId() Returns the public identifier. |
String[] | getRecognizedFeatures() Returns recognized features. |
String[] | getRecognizedProperties() Returns recognized properties. |
protected static String | getValue(XMLAttributes attrs, String aname) Returns the value of the specified attribute, ignoring case. |
String | getXMLVersion() Returns the xml version. |
protected int | load(int offset)
Loads a new chunk of data into the buffer and returns the number of
characters loaded or -1 if no additional characters were loaded.
|
protected Augmentations | locationAugs() Returns an augmentations object with a location item added. |
protected static String | modifyName(String name, short mode) Modifies the given name based on the specified mode. |
void | pushInputSource(XMLInputSource inputSource)
Pushes an input source onto the current entity stack. |
protected int | read() Reads a single character. |
void | reset(XMLComponentManager manager) Resets the component. |
protected XMLResourceIdentifier | resourceId() Returns an empty resource identifier. |
protected void | scanDoctype() Scans a DOCTYPE line. |
boolean | scanDocument(boolean complete) Scans the document. |
protected int | scanEntityRef(XMLStringBuffer str, boolean content) Scans an entity reference. |
protected String | scanLiteral() Scans a quoted literal. |
protected String | scanName() Scans a name. |
void | setDocumentHandler(XMLDocumentHandler handler) Sets the document handler. |
void | setFeature(String featureId, boolean state) Sets a feature. |
void | setInputSource(XMLInputSource source) Sets the input source. |
void | setProperty(String propertyId, Object value) Sets a property. |
protected void | setScanner(HTMLScanner.Scanner scanner) Sets the scanner. |
protected void | setScannerState(short state) Sets the scanner state. |
protected boolean | skip(String s, boolean caseSensitive) Returns true if the specified text is present and is skipped. |
protected boolean | skipMarkup(boolean balance) Skips markup. |
protected int | skipNewlines() Skips newlines and returns the number of newlines skipped. |
protected int | skipNewlines(int maxlines) Skips newlines and returns the number of newlines skipped. |
protected boolean | skipSpaces() Skips whitespace. |
protected Augmentations | synthesizedAugs() Returns an augmentations object with a synthesized item added. |
Note: This includes the five pre-defined XML general entities.
Note: This only applies to the five pre-defined XML general entities. Specifically, "amp", "lt", "gt", "quot", and "apos". This is done for compatibility with the Xerces feature.
To be notified of the built-in entity references in HTML, set the
http://cyberneko.org/html/features/scanner/notify-builtin-refs
feature to true
.
Parameters: closeall Close all streams, including the original. This is used in cases when the application has opened the original document stream and should be responsible for closing it.
Parameters: systemId The systemId to be expanded.
Returns: Returns the URI string representing the expanded system identifier. A null value indicates that the given system identifier is already expanded.
Parameters: str The string to fix.
Returns: Returns the fixed URI string.
Details about this common problem can be found at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
See Also: NAMES_NO_CHANGE NAMES_LOWERCASE NAMES_UPPERCASE
Parameters: offset The offset at which new characters should be loaded.
Note: This functionality is experimental at this time and is subject to change in future releases of NekoHTML.
Parameters: inputSource The new input source to start scanning.