Base Parsers

Parsing Results

class html2ans.parsers.base.ParseResult[source]

Bases: html2ans.parsers.base.ParseResult

A wrapper for holding the results of parsing.

output is the ANS JSON parsed by the parser.

match indicates whether or not other parse attempts should be made.

The idea of the parsing “match” is necessary so that we can try multiple parsers per tag (and not try multiple parsers when we don’t have to). For example, when parsing <p></p>, if we only returned an empty dictionary with the first available parser, the next logical step is to try the next parser. By returning match=True in that situation, we don’t make any more parsing attempts, we just move on to the next element in the tree.

Parameters
  • output (dict or list[dict]) – The output ANS

  • match (bool) – Whether or not this parse was a match for a given element

Parser Interface

class html2ans.parsers.base.ElementParser[source]

Bases: object

Element parsing interface.

is_applicable(element, *args, **kwargs)[source]

Indicates if the given element is something this parser can/should be parsing.

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to check for applicability

parse(element, *args, **kwargs)[source]

Parses the given element.

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

Base Parser

class html2ans.parsers.base.BaseElementParser[source]

Bases: html2ans.parsers.base.ElementParser, html2ans.parsers.utils.AbstractParserUtilities

Base element parser; assumes elements are being parsed using BeautifulSoup. Provides a standard method of checking for applicability via applicable_elements and applicable_classes.

version_required = False

Whether or not a version should be added to this parser’s output in the root document parser

applicable_elements = []

The types of elements this parser should be used on. For example, applicable_elements = [Comment, 'br'] indicates that this parser is meant for Comment objects and br tags.

applicable_classes = []

The classes of elements this parser should be used on. This is an extra requirement on top of applicable_elements–if this list is populated, then an HTML tag must have an applicable name and an applicable class in order to be considered. For example, if applicable_elements = ['div'] and applicable_classes = ['headlines'], <div><img ...><p>My headline</p></div> would not be considered applicable but <div class="headlines"><img ...><p>My headline</p></div> would.

is_applicable(element, *args, **kwargs)[source]

Checks applicability using applicable_elements and, optionally, applicable_classes

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

construct_output(element, ans_type=None, content=None, version=None, *args, **kwargs)[source]

Convenience method for constructing an output dictionary. If element is a Tag with attributes, those attributes will be stashed in additional_properties.

Parameters
  • element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element being parsed

  • ans_type (str) – the ANS type to put in the output type field

  • content (str) – the content to put in the content field

  • version (str) – the version to put in the version field. Note: if not provided but version_required=True on this parser, the output will receive a version from the root parser

Ignore Elements (Null Parser)

class html2ans.parsers.base.NullParser[source]

Bases: html2ans.parsers.base.BaseElementParser

Parser for elements we want to ignore in the output ANS.

is_applicable(element, *args, **kwargs)[source]

Checks applicability using applicable_elements and, optionally, applicable_classes

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

parse(element, *args, **kwargs)[source]

Parses the given element.

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse