Base Parsers¶

Parsing Results¶

class html2ans.parsers.base.ParseResult[source]¶

Bases: html2ans.parsers.base.ParseResult

A wrapper for holding the results of parsing.

output is the ANS JSON parsed by the parser.

match indicates whether or not other parse attempts should be made.

The idea of the parsing “match” is necessary so that we can try multiple parsers per tag (and not try multiple parsers when we don’t have to). For example, when parsing <p></p>, if we only returned an empty dictionary with the first available parser, the next logical step is to try the next parser. By returning match=True in that situation, we don’t make any more parsing attempts, we just move on to the next element in the tree.

Parameters

output (dict or list[dict]) – The output ANS
match (bool) – Whether or not this parse was a match for a given element

Parser Interface¶

class html2ans.parsers.base.ElementParser[source]¶

Bases: object

Element parsing interface.

is_applicable(element, *args, **kwargs)[source]¶

Indicates if the given element is something this parser can/should be parsing.

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to check for applicability

parse(element, *args, **kwargs)[source]¶

Parses the given element.

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

Base Parser¶

class html2ans.parsers.base.BaseElementParser[source]¶

Bases: html2ans.parsers.base.ElementParser, html2ans.parsers.utils.AbstractParserUtilities

Base element parser; assumes elements are being parsed using BeautifulSoup. Provides a standard method of checking for applicability via applicable_elements and applicable_classes.

version_required = False¶: Whether or not a version should be added to this parser’s output in the root document parser

applicable_elements = []¶: The types of elements this parser should be used on. For example, applicable_elements = [Comment, 'br'] indicates that this parser is meant for Comment objects and br tags.

applicable_classes = []¶: The classes of elements this parser should be used on. This is an extra requirement on top of applicable_elements–if this list is populated, then an HTML tag must have an applicable name and an applicable class in order to be considered. For example, if applicable_elements = ['div'] and applicable_classes = ['headlines'], <div><img ...><p>My headline</p></div> would not be considered applicable but <div class="headlines"><img ...><p>My headline</p></div> would.

is_applicable(element, *args, **kwargs)[source]¶

Checks applicability using applicable_elements and, optionally, applicable_classes

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

construct_output(element, ans_type=None, content=None, version=None, *args, **kwargs)[source]¶

Convenience method for constructing an output dictionary. If element is a Tag with attributes, those attributes will be stashed in additional_properties.

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element being parsed
ans_type (str) – the ANS type to put in the output type field
content (str) – the content to put in the content field
version (str) – the version to put in the version field. Note: if not provided but version_required=True on this parser, the output will receive a version from the root parser

Ignore Elements (Null Parser)¶

class html2ans.parsers.base.NullParser[source]¶

Bases: html2ans.parsers.base.BaseElementParser

Parser for elements we want to ignore in the output ANS.

is_applicable(element, *args, **kwargs)[source]¶

Checks applicability using applicable_elements and, optionally, applicable_classes

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

parse(element, *args, **kwargs)[source]¶

Parses the given element.

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse