2.2.1. parsers – Generic Webpage Parsers for the browser Module

Defines a set of useful parser functions, and generators for parser functions. A parser function must take three arguments:

source
The non-decoded raw byte result from the page load. This should be the result from a call to read() from the urllib file hander resulting from calling urllib.
headers
The result of a call to info() on the file handler.
url
The url of the current page (after redirects).
lib.browser.parsers.beautiful_soup_html(source, headers, url)

Returns a BeautifulSoup.BeautifulSoup object using the BeautifulSoup library. BeautifulSoup is very error resistant, and may be useful for some rather broken html. Additionally, it’s written in pure python, unlike lxml which has native dependencies. Unfortunately, it is rather slow compared to lxml, and it has poor Python 3 support at the moment.

lib.browser.parsers.beautiful_soup_xml(source, headers, url)

Returns a BeautifulSoup.BeautifulStoneSoup object using the BeautifulSoup library. BeautifulSoup is very error resistant, and may be useful for some rather broken xml. Additionally, it’s written in pure python, unlike lxml which has native dependencies. Unfortunately, it is rather slow compared to lxml, and it has poor Python 3 support at the moment.

lib.browser.parsers.htmlparser(subclass, *args, **kwargs)

Taking a subclass of html.parser.HTMLParser (in py3k) or HTMLParser.HTMLParser (in py2), or alternatively a factory returning a subclass of one of those, builds a parser. When given data, the returned parser will construct a new instance of the subclass

lib.browser.parsers.lxml_html(source, headers, url)

Returns an lxml.etree.ElementTree() generated with lxml’s html module.

lib.browser.parsers.lxml_xml(source, headers, url)

Returns an lxml.etree.ElementTree() generated with lxml’s etree module.

lib.browser.parsers.passthrough(source, headers, url)

Returns the byte data given my the read() method on the result from urlopen. This just returns the source argument that it’s passed.

lib.browser.parsers.passthrough_args(source, headers, url)

Returns a tuple of the arguments it’s given. You can then use these arguments to call multiple other parsers. Here’s an example:

args = lib.browser.load_page(url, parser=passthrough_args)
lxml_data = lxml_html(*args)
str_data = passthrough_str(*args)
lib.browser.parsers.passthrough_str(byte_source, headers, url)

Like passthrough(), but returns a unicod str object rather than a sequence of bytes.

Encoding is automatically determined via http headers or (if that fails) from the html source if it has a meta http-equiv tag to handle it (assuming that tag can be decoded via a best-attempt UTF-8 decoding). If encoding cannot be determined, we just try to decode it in UTF-8, ignoring unknown characters.

Unfortunately, this function does not yet have a system like BeautifulSoup.UnicodeDammit, or chardet, which can actually build a statistical model of the page’s possible encoding.

lib.browser.parsers.passthrough_str_with_encoding(encoding)

Creates and returns a custom version of passthrough_str(), utilizing a specified string encoding format, rather than attempting to automatically detect things.