Document Parsing

Interface

class html2ans.base.AbstractHtmlAnsParser[source]

Bases: object

The abstract base root/top-level parser class. Makes no assumptions about underlying libraries.

generate_ans(html, *args, **kwargs)[source]

Parses html and produces ANS in a jsonify-able format.

Parameters

html (str) – the html to parse

Returns

a list of ANS elements as dictionaries

Base HTML Parser

class html2ans.base.BaseHtmlAnsParser(ans_version=None, soup_parse_lib='lxml', suppress_exceptions=False, default_parsers=None, *args, **kwargs)[source]

Bases: html2ans.base.AbstractHtmlAnsParser, html2ans.parsers.utils.AbstractParserUtilities

The base root/top-level parser class; assumes elements will be parsed with BeautifulSoup. Use this class to generate a list of ANS elements from an HTML document. HTML elements within the document will be parsed using element parsers present in the parsers attribute. The parsers attribute is populated first with the parsers from the DEFAULT_PARSERS variable. Other parsers can be added using this class’s add_parser and insert_parser methods.

Attempts at parsing an element will use each parser in an element’s parser list in order. If a parser isn’t applicable (is_applicable returns False) or parse indicates this element wasn’t a match, the next parser is tried. Why are there two ways of indicating if an element/parser are compatable? You can’t always tell if an element/parser are compatible until you’re parsing the element. is_applicable catches early, obvious issues (like “this parser is for img elements; is this an img?).

Using an ans_version won’t affect the output of these parsers in terms of actual ANS version compatibility; it is provided as a convenience (sometimes you need to update the overall ANS version because of a specific new feature, but won’t want to update the output of all parsers).

Use the suppress_exceptions option to treat element parsing exceptions as non-matches. With suppress_exceptions, when exceptions are thrown, the next parser will be tried (as though is_applicable returned False).

Parameters
  • ans_version (str) – the ANS version to apply to the output of parsers that require a version

  • bs_parse_lib (str) – the BeautifulSoup parsing library to use

  • suppress_exceptions (bool) – whether or not to suppress exceptions during element parsing

  • default_parsers (list) – The default parsers to populate parsers with. Order matters here!

BACKUP_PARSERS = []

The backup parsers for this class to use. This list will be used on _every_ parsing attempt if the primary parsers for a given type don’t match. For example, if parsers contains 'p': [ParagraphParser()] and an exception is thrown when processing a paragraph tag using the ParagraphParser, all backup parsers will also be tried. In the default implementation, this essentially means that producing raw_html will be the last resort when all other parsers fail.

parsers = None

A mapping of potential HTML elements to a list of parsers to use to attempt to parse elements of that type

generate_ans(html, start_tag='body', *args, **kwargs)[source]

Parses html and produces ANS in a jsonify-able format.

Parameters
  • html (str) – the html to parse

  • start_tag (str) – where to start parsing (if not provided, all tags will be parsed)

Returns

a list of ANS elements as dictionaries

insert_parser(element_key, parser, position=None, *args, **kwargs)[source]

Insert a parser of the given type into the list of parsers for that type.

Parameters
  • element_key (str) – The key of the element in self.parsers (e.g. ‘p’)

  • parser (html2ans.parsers.base.ElementParser) – The parser object to insert

  • position (int) – Where to insert the parser in the list

add_parser(parser, *args, **kwargs)[source]

Add a parser to self.parsers using the parser’s applicable_elements.

Parameters

parser (html2ans.parsers.base.BaseElementParser) – The parser object to insert

Default HTML Parser

class html2ans.default.DefaultHtmlAnsParser(*args, **kwargs)[source]

Bases: html2ans.base.BaseHtmlAnsParser

The default root/top-level parser class.

DEFAULT_PARSERS = [<html2ans.parsers.embeds.ArcPlayerEmbedParser object>, <html2ans.parsers.embeds.DailyMotionEmbedParser object>, <html2ans.parsers.embeds.FacebookPostEmbedParser object>, <html2ans.parsers.embeds.FacebookVideoEmbedParser object>, <html2ans.parsers.embeds.FlickrEmbedParser object>, <html2ans.parsers.embeds.ImgurEmbedParser object>, <html2ans.parsers.embeds.InstagramEmbedParser object>, <html2ans.parsers.embeds.PollDaddyEmbedParser object>, <html2ans.parsers.embeds.RedditEmbedParser object>, <html2ans.parsers.embeds.SpotifyEmbedParser object>, <html2ans.parsers.embeds.TumblrEmbedParser object>, <html2ans.parsers.embeds.TwitterTweetEmbedParser object>, <html2ans.parsers.embeds.TwitterVideoEmbedParser object>, <html2ans.parsers.embeds.YoutubeEmbedParser object>, <html2ans.parsers.embeds.VimeoEmbedParser object>, <html2ans.parsers.embeds.VineEmbedParser object>, <html2ans.parsers.text.HeaderParser object>, <html2ans.parsers.text.ListParser object>, <html2ans.parsers.text.FormattedTextParser object>, <html2ans.parsers.text.BlockquoteParser object>, <html2ans.parsers.text.ParagraphParser object>, <html2ans.parsers.image.LinkedImageParser object>, <html2ans.parsers.image.ImageParser object>, <html2ans.parsers.image.FigureParser object>, <html2ans.parsers.text.InterstitialLinkParser object>, <html2ans.parsers.audio.AudioParser object>, <html2ans.parsers.embeds.IFrameParser object>, <html2ans.parsers.base.NullParser object>]

Default parsers for the default implementation. These will be added to the BaseHtmlAnsParser parsers attribute in the order listed, so order matters!

BACKUP_PARSERS = [<html2ans.parsers.embeds.IFrameParser object>, <html2ans.parsers.raw_html.RawHtmlParser object>]

Backup parsers for the default implementation. These will be tried in the order listed, so order matters!