Base Parsers¶
Parsing Results¶
-
class
html2ans.parsers.base.
ParseResult
[source]¶ Bases:
html2ans.parsers.base.ParseResult
A wrapper for holding the results of parsing.
output
is the ANS JSON parsed by the parser.match
indicates whether or not other parse attempts should be made.The idea of the parsing “match” is necessary so that we can try multiple parsers per tag (and not try multiple parsers when we don’t have to). For example, when parsing
<p></p>
, if we only returned an empty dictionary with the first available parser, the next logical step is to try the next parser. By returningmatch=True
in that situation, we don’t make any more parsing attempts, we just move on to the next element in the tree.- Parameters
output (dict or list[dict]) – The output ANS
match (bool) – Whether or not this parse was a match for a given element
Parser Interface¶
-
class
html2ans.parsers.base.
ElementParser
[source]¶ Bases:
object
Element parsing interface.
Base Parser¶
-
class
html2ans.parsers.base.
BaseElementParser
[source]¶ Bases:
html2ans.parsers.base.ElementParser
,html2ans.parsers.utils.AbstractParserUtilities
Base element parser; assumes elements are being parsed using BeautifulSoup. Provides a standard method of checking for applicability via
applicable_elements
andapplicable_classes
.-
version_required
= False¶ Whether or not a version should be added to this parser’s output in the root document parser
-
applicable_elements
= []¶ The types of elements this parser should be used on. For example,
applicable_elements = [Comment, 'br']
indicates that this parser is meant forComment
objects and br tags.
-
applicable_classes
= []¶ The classes of elements this parser should be used on. This is an extra requirement on top of
applicable_elements
–if this list is populated, then an HTML tag must have an applicable name and an applicable class in order to be considered. For example, ifapplicable_elements = ['div']
andapplicable_classes = ['headlines']
,<div><img ...><p>My headline</p></div>
would not be considered applicable but<div class="headlines"><img ...><p>My headline</p></div>
would.
-
is_applicable
(element, *args, **kwargs)[source]¶ Checks applicability using
applicable_elements
and, optionally,applicable_classes
- Parameters
element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse
-
construct_output
(element, ans_type=None, content=None, version=None, *args, **kwargs)[source]¶ Convenience method for constructing an output dictionary. If element is a
Tag
with attributes, those attributes will be stashed inadditional_properties
.- Parameters
element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element being parsed
ans_type (str) – the ANS type to put in the output
type
fieldcontent (str) – the content to put in the
content
fieldversion (str) – the version to put in the
version
field. Note: if not provided butversion_required=True
on this parser, the output will receive a version from the root parser
-
Ignore Elements (Null Parser)¶
-
class
html2ans.parsers.base.
NullParser
[source]¶ Bases:
html2ans.parsers.base.BaseElementParser
Parser for elements we want to ignore in the output ANS.