Class LexML

  • Direct Known Subclasses:
    LexHTML

    public class LexML
    extends java.lang.Object
    This class breaks angle-bracket-separated markup languages like SGML, XML, and HTML into tokens. It understands three types of tokens:
    tags
    Formally known as "entities", tags are delimited by "<" and ">". The first word in the tag is the tag name and the rest of the tag consists of the attributes, a set of "name=value" or "name" data. Spaces in tags are not significant except for quoted values in the attributes.
    string
    Plain strings that are not in angle-brackets. Spaces are significant and preserved.
    comments
    Delimited by "<!--" and "-->". All text between the delimiters is part of the comment. However, by convention, some comments actually contain data and so the methods that extract the fields from tags can be used to attempt to extract the fields from comments, too. Spaces are significant and preserved in a comment, unless the comment is treated as a tag, in which the tag rules apply.

    This class is intended to parse markup languages, not to validate them. "Malformed" data is interpreted as graciously as possible, in order to extract as much information as possible. For instance: spaces are allowed between the "<" and the tag name, values in tags do not need to be quoted, and unbalanced quotes are accepted.

    One type of "malformed" data specifically not handled is a quoted ">" character occurring within the body of a tag. Even if it is quoted, a ">" in the attributes of a tag will be interpreted as the end of the tag. For example, the single tag <img src='foo.jpg' alt='xyz > abc'> will be erroneously broken by this parser into two tokens:

    • the tag <img src='foo.jpg' alt='xyz >
    • the string "abc'>" (and possibly whatever text follows after).
    Unfortunately, this type of "malformed" data is known to occur regularly.

    This class also may not properly parse all well-formed XML tags, such as tags with extended paired delimiters <& and &>, <? and ?>, or <![CDATA[ and ]]>. Additionally, XML tags that have embedded comments containing the ">" character will not be parsed correctly (for example: <!DOCTYPE foo SYSTEM -- a > b -- foo.dtd>), since the ">" in the comment will be interpreted as the end of declaration tag, for the same reason mentioned above.
    Note: this behavior may be changed on a per-application basis by overriding the findClose method in a subclass.

    Version:
    2.7
    Author:
    Colin Stevens (colin.stevens@sun.com)
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int COMMENT
      The value returned by getType for comment tokens
      protected java.lang.String str  
      static int STRING
      The value returned by getType for string tokens
      static int TAG
      The value returned by getType for tag tokens
    • Constructor Summary

      Constructors 
      Constructor Description
      LexML​(java.lang.String str)
      Create a new ML parser, which can be used to iterate over the tokens in the given string.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected int findClose​(int start)
      Find the closing tag ">".
      This may be overriden by sub-classes to allow more sophisticated behavior.
      java.lang.String getArgs()
      Gets the name/value pairs in the body of the current tag as a string.
      StringMap getAttributes()
      Gets the name/value pairs in the body of the current tag as a table.
      java.lang.String getBody()
      Gets the string making up the current token, not including the angle brackets or comment delimiters, if appropriate.
      int getLocation()
      Return the current processing location.
      java.lang.String getString()
      Return the string we are currently processing
      java.lang.String getTag()
      Gets the tag name at the beginning of the current tag.
      java.lang.String getToken()
      Gets the string making up the whole current token, including the brackets or comment delimiters, if appropriate.
      int getType()
      Gets the type of the current token.
      boolean isSingleton()
      A tag is a "singleton" if the closing ">" is preceded by a slash (/).
      boolean nextToken()
      Advances to the next token.
      void replace​(java.lang.String str)
      Changes the string that this LexML is parsing.
      java.lang.String rest()
      Gets the rest of the string that has not yet been parsed.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • COMMENT

        public static final int COMMENT
        The value returned by getType for comment tokens
        See Also:
        Constant Field Values
      • TAG

        public static final int TAG
        The value returned by getType for tag tokens
        See Also:
        Constant Field Values
      • STRING

        public static final int STRING
        The value returned by getType for string tokens
        See Also:
        Constant Field Values
      • str

        protected java.lang.String str
    • Constructor Detail

      • LexML

        public LexML​(java.lang.String str)
        Create a new ML parser, which can be used to iterate over the tokens in the given string.
        Parameters:
        str - The ML to parse.
    • Method Detail

      • nextToken

        public boolean nextToken()
        Advances to the next token. The user can then call the other methods in this class to get information about the new current token.
        Returns:
        true if a token was found, false if there were no more tokens left.
      • findClose

        protected int findClose​(int start)
        Find the closing tag ">".
        This may be overriden by sub-classes to allow more sophisticated behavior.
        Parameters:
        start - The starting index in str to look for the matching >
        Returns:
        The index of str that contains the matching >
      • getType

        public int getType()
        Gets the type of the current token.
        Returns:
        The type.
        See Also:
        COMMENT, TAG, STRING
      • isSingleton

        public boolean isSingleton()
        A tag is a "singleton" if the closing ">" is preceded by a slash (/). (e.g. <br/>
      • getToken

        public java.lang.String getToken()
        Gets the string making up the whole current token, including the brackets or comment delimiters, if appropriate.
        Returns:
        The current token.
      • getBody

        public java.lang.String getBody()
        Gets the string making up the current token, not including the angle brackets or comment delimiters, if appropriate.
        Returns:
        The body of the token.
      • getString

        public java.lang.String getString()
        Return the string we are currently processing
      • getLocation

        public int getLocation()
        Return the current processing location.
        Returns:
        The character index of the current tag.
      • getTag

        public java.lang.String getTag()
        Gets the tag name at the beginning of the current tag. In other words, the tag name for <table border=3> is "table". Any surrounding space characters are removed, but the case of the tag is preserved.

        For comments, the "tag" is the first word in the comment. This can be used to help parse comments that are structured similar to regular tags, such as server-side include comments like <!--#include virtual="file.inc">. The tag in this case would be "!--#include".

        Returns:
        The tag name, or null if the current token was a string.
      • getArgs

        public java.lang.String getArgs()
        Gets the name/value pairs in the body of the current tag as a string.
        Returns:
        The name/value pairs, or null if the current token was a string.
      • getAttributes

        public StringMap getAttributes()
        Gets the name/value pairs in the body of the current tag as a table.

        Any quote marks in the body, either single or double quotes, are left on the values, so that the values can be easily re-emitted and still form a valid body.

        For names that have no associated value in the tag, the value is stored as the empty string "". Therefore, the two tags <table border> and <table border=""> cannot be distinguished based on the result of calling getAttributes.

        Returns:
        The table of name/value pairs, or null if the current token was a string.
      • rest

        public java.lang.String rest()
        Gets the rest of the string that has not yet been parsed.

        Example use: to help the parser in circumstances such as the HTML "<script>" tag where the script body doesn't the obey the rules because it might contain lone "<" or ">" characters, which this parser would interpret as the start or end of funny-looking tags.

        Returns:
        The unparsed remainder of the string.
        See Also:
        replace(java.lang.String)
      • replace

        public void replace​(java.lang.String str)
        Changes the string that this LexML is parsing.

        Example use: the caller decided to parse part of the body, and now wants this LexML to pick up and parse the rest of it.

        Parameters:
        str - The string that this LexML should now parse. Whatever string this LexML was parsing is forgotten, and it now starts parsing at the beginning of the new string.
        See Also:
        rest()