Base Parsers¶
Parsing Results¶
-
class
html2ans.parsers.base.ParseResult[source]¶ Bases:
html2ans.parsers.base.ParseResultA wrapper for holding the results of parsing.
outputis the ANS JSON parsed by the parser.matchindicates whether or not other parse attempts should be made.The idea of the parsing “match” is necessary so that we can try multiple parsers per tag (and not try multiple parsers when we don’t have to). For example, when parsing
<p></p>, if we only returned an empty dictionary with the first available parser, the next logical step is to try the next parser. By returningmatch=Truein that situation, we don’t make any more parsing attempts, we just move on to the next element in the tree.- Parameters
output (dict or list[dict]) – The output ANS
match (bool) – Whether or not this parse was a match for a given element
Parser Interface¶
-
class
html2ans.parsers.base.ElementParser[source]¶ Bases:
objectElement parsing interface.
Base Parser¶
-
class
html2ans.parsers.base.BaseElementParser[source]¶ Bases:
html2ans.parsers.base.ElementParser,html2ans.parsers.utils.AbstractParserUtilitiesBase element parser; assumes elements are being parsed using BeautifulSoup. Provides a standard method of checking for applicability via
applicable_elementsandapplicable_classes.-
version_required= False¶ Whether or not a version should be added to this parser’s output in the root document parser
-
applicable_elements= []¶ The types of elements this parser should be used on. For example,
applicable_elements = [Comment, 'br']indicates that this parser is meant forCommentobjects and br tags.
-
applicable_classes= []¶ The classes of elements this parser should be used on. This is an extra requirement on top of
applicable_elements–if this list is populated, then an HTML tag must have an applicable name and an applicable class in order to be considered. For example, ifapplicable_elements = ['div']andapplicable_classes = ['headlines'],<div><img ...><p>My headline</p></div>would not be considered applicable but<div class="headlines"><img ...><p>My headline</p></div>would.
-
is_applicable(element, *args, **kwargs)[source]¶ Checks applicability using
applicable_elementsand, optionally,applicable_classes- Parameters
element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse
-
construct_output(element, ans_type=None, content=None, version=None, *args, **kwargs)[source]¶ Convenience method for constructing an output dictionary. If element is a
Tagwith attributes, those attributes will be stashed inadditional_properties.- Parameters
element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element being parsed
ans_type (str) – the ANS type to put in the output
typefieldcontent (str) – the content to put in the
contentfieldversion (str) – the version to put in the
versionfield. Note: if not provided butversion_required=Trueon this parser, the output will receive a version from the root parser
-
Ignore Elements (Null Parser)¶
-
class
html2ans.parsers.base.NullParser[source]¶ Bases:
html2ans.parsers.base.BaseElementParserParser for elements we want to ignore in the output ANS.