Parser Utilities¶
-
html2ans.parsers.utils.has_attributes(tag, filter_types=('id', 'class', 'style'))[source]¶ Helper function to check if a tag has attributes (excluding the given
filter_types).
-
html2ans.parsers.utils.parse_dimensions(tag, tag_json, dimension_keys=('width', 'height'))[source]¶ Adds dimensions to converted JSON. Images and iframes will generally have width/height properties; this is just a convenience method for adding those properties.
-
class
html2ans.parsers.utils.AbstractParserUtilities[source]¶ Bases:
objectCommon utility functions for parsers. These methods are grouped here (rather than in the
BaseElementParser) because they are used both in element parsing and in document parser (i.e. by theBaseHtmlAnsParser).-
WRAPPER_TAGS= ['p', 'div']¶ Which tags to consider as potential wrappers in the
is_wrappermethod.
-
EMPTY_STRINGS= [None, '', ' ', '\n', '<br>', '<br/>']¶ List of strings considered empty (if a
NavigableStringis passed tois_emptyand the string is in this list,is_emptywill return True).
-
EMPTY_TAGS= ['br']¶ List of tags considered empty (if a tag passed to
is_emptyis in this list,is_emptywill return True).
-
TEXT_TAGS= ['a', 'b', 'del', 'em', 'i', 'ins', 'mark', 'small', 'strong', 'sub', 'sup', 'span', 'u', 'p', 'blockquote', 'li']¶ List of tags considered to be text. This affects the results of
is_text_onlywhich is used by most text parsers. For example, because by defaultatags are considered text,<p>Here is a <a href="google.com">link</a></p>would be considered text only.
-
classmethod
is_empty(element, *args, **kwargs)[source]¶ Returns true if the given tag is empty. Things like
, <p> </p>, <iframe />
are considered empty. :param tag: the tag to check :return: True if empty
-
classmethod
is_text_only(element, *args, **kwargs)[source]¶ Returns true if the given tag only has
NavigableStringor tags inTEXT_TYPESfor children. :param element: :return: True if this element only contains text
-
classmethod
is_wrapper(element, *args, **kwargs)[source]¶ Returns true if this tag is only wrapping other content. :param element: :return: True if the given element is only wrapping sub-content
-
classmethod
get_children(element, filter_tags=None, filter_types=None)[source]¶ Returns the given tag’s children (excluding the
filter_typesandfilter_tags). :param element: the element to check :param filter_types: class types to filter from the tag’s children :param filter_tags: tag names to filter from the tag’s children :return: the unfiltered/unempty children if element is a Tag, else []
-