Parser Utilities¶
-
html2ans.parsers.utils.
has_attributes
(tag, filter_types=('id', 'class', 'style'))[source]¶ Helper function to check if a tag has attributes (excluding the given
filter_types
).
-
html2ans.parsers.utils.
parse_dimensions
(tag, tag_json, dimension_keys=('width', 'height'))[source]¶ Adds dimensions to converted JSON. Images and iframes will generally have width/height properties; this is just a convenience method for adding those properties.
-
class
html2ans.parsers.utils.
AbstractParserUtilities
[source]¶ Bases:
object
Common utility functions for parsers. These methods are grouped here (rather than in the
BaseElementParser
) because they are used both in element parsing and in document parser (i.e. by theBaseHtmlAnsParser
).-
WRAPPER_TAGS
= ['p', 'div']¶ Which tags to consider as potential wrappers in the
is_wrapper
method.
-
EMPTY_STRINGS
= [None, '', ' ', '\n', '<br>', '<br/>']¶ List of strings considered empty (if a
NavigableString
is passed tois_empty
and the string is in this list,is_empty
will return True).
-
EMPTY_TAGS
= ['br']¶ List of tags considered empty (if a tag passed to
is_empty
is in this list,is_empty
will return True).
-
TEXT_TAGS
= ['a', 'b', 'del', 'em', 'i', 'ins', 'mark', 'small', 'strong', 'sub', 'sup', 'span', 'u', 'p', 'blockquote', 'li']¶ List of tags considered to be text. This affects the results of
is_text_only
which is used by most text parsers. For example, because by defaulta
tags are considered text,<p>Here is a <a href="google.com">link</a></p>
would be considered text only.
-
classmethod
is_empty
(element, *args, **kwargs)[source]¶ Returns true if the given tag is empty. Things like
, <p> </p>, <iframe />
are considered empty. :param tag: the tag to check :return: True if empty
-
classmethod
is_text_only
(element, *args, **kwargs)[source]¶ Returns true if the given tag only has
NavigableString
or tags inTEXT_TYPES
for children. :param element: :return: True if this element only contains text
-
classmethod
is_wrapper
(element, *args, **kwargs)[source]¶ Returns true if this tag is only wrapping other content. :param element: :return: True if the given element is only wrapping sub-content
-
classmethod
get_children
(element, filter_tags=None, filter_types=None)[source]¶ Returns the given tag’s children (excluding the
filter_types
andfilter_tags
). :param element: the element to check :param filter_types: class types to filter from the tag’s children :param filter_tags: tag names to filter from the tag’s children :return: the unfiltered/unempty children if element is a Tag, else []
-