BeautifulSoup Parser

BeautifulSoup is a Python package that parses broken HTML. While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is much more forgiving and has superiour support for encoding detection.

lxml can benefit from the parsing capabilities of BeautifulSoup through the lxml.html.soupparser module. It provides three main functions: fromstring() and parse() to parse a string or file using BeautifulSoup, and convert_tree() to convert an existing BeautifulSoup tree into a list of top-level Elements.

The functions fromstring() and parse() behave as known from ElementTree. The first returns a root Element, the latter returns an ElementTree.

Here is a document full of tag soup, similar to, but not quite like, HTML:

>>> tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>'

all you need to do is pass it to the fromstring() function:

>>> from lxml.html.soupparser import fromstring
>>> root = fromstring(tag_soup)

To see what we have here, you can serialise it:

>>> from lxml.etree import tostring
>>> print tostring(root, pretty_print=True),
<html>
  <meta/>
  <head>
    <title>Hello</title>
  </head>
  <body onload="crash()">Hi all<p/></body>
</html>

Not quite what you'd expect from an HTML page, but, well, it was broken already, right? BeautifulSoup did its best, and so now it's a tree.

To control which Element implementation is used, you can pass a makeelement factory function to parse() and fromstring(). By default, this is based on the HTML parser defined in lxml.html.

By default, the BeautifulSoup parser also replaces the entities it finds by their character equivalent:

>>> tag_soup = '<body>&copy;&euro;&#45;&#245;&#445;<p>'
>>> body = fromstring(tag_soup).find('.//body')
>>> body.text
u'\xa9\u20ac-\xf5\u01bd'

If you want them back on the way out, you can serialise with the 'html' method, which will always use escaping for safety reasons:

>>> tostring(body, method="html")
'<body>&#xA9;&#x20AC;-&#xF5;&#x1BD;<p></p></body>'

>>> tostring(body, method="html", encoding="utf-8")
'<body>&#xA9;&#x20AC;-&#xF5;&#x1BD;<p></p></body>'

>>> tostring(body, method="html", encoding=unicode)
u'<body>&#xA9;&#x20AC;-&#xF5;&#x1BD;<p></p></body>'

Otherwise, when serialising to XML, only the plain ASCII encoding will escape non-ASCII characters:

>>> tostring(body)
'<body>&#169;&#8364;-&#245;&#445;<p/></body>'

>>> tostring(body, encoding="utf-8")
'<body>\xc2\xa9\xe2\x82\xac-\xc3\xb5\xc6\xbd<p/></body>'

>>> tostring(body, encoding=unicode)
u'<body>\xa9\u20ac-\xf5\u01bd<p/></body>'

There is also a legacy module called lxml.html.ElementSoup, which mimics the interface provided by ElementTree's own ElementSoup module.