Parser Utilities

html2ans.parsers.utils.has_attributes(tag, filter_types=('id', 'class', 'style'))[source]

Helper function to check if a tag has attributes (excluding the given filter_types).

html2ans.parsers.utils.parse_dimensions(tag, tag_json, dimension_keys=('width', 'height'))[source]

Adds dimensions to converted JSON. Images and iframes will generally have width/height properties; this is just a convenience method for adding those properties.

class html2ans.parsers.utils.AbstractParserUtilities[source]

Bases: object

Common utility functions for parsers. These methods are grouped here (rather than in the BaseElementParser) because they are used both in element parsing and in document parser (i.e. by the BaseHtmlAnsParser).

WRAPPER_TAGS = ['p', 'div']

Which tags to consider as potential wrappers in the is_wrapper method.

EMPTY_STRINGS = [None, '', ' ', '\n', '<br>', '<br/>']

List of strings considered empty (if a NavigableString is passed to is_empty and the string is in this list, is_empty will return True).

EMPTY_TAGS = ['br']

List of tags considered empty (if a tag passed to is_empty is in this list, is_empty will return True).

TEXT_TAGS = ['a', 'b', 'del', 'em', 'i', 'ins', 'mark', 'small', 'strong', 'sub', 'sup', 'span', 'u', 'p', 'blockquote', 'li']

List of tags considered to be text. This affects the results of is_text_only which is used by most text parsers. For example, because by default a tags are considered text, <p>Here is a <a href="google.com">link</a></p> would be considered text only.

classmethod is_empty(element, *args, **kwargs)[source]

Returns true if the given tag is empty. Things like

, <p> </p>, <iframe />

are considered empty. :param tag: the tag to check :return: True if empty

classmethod is_text_only(element, *args, **kwargs)[source]

Returns true if the given tag only has NavigableString or tags in TEXT_TYPES for children. :param element: :return: True if this element only contains text

classmethod is_wrapper(element, *args, **kwargs)[source]

Returns true if this tag is only wrapping other content. :param element: :return: True if the given element is only wrapping sub-content

classmethod get_children(element, filter_tags=None, filter_types=None)[source]

Returns the given tag’s children (excluding the filter_types and filter_tags). :param element: the element to check :param filter_types: class types to filter from the tag’s children :param filter_tags: tag names to filter from the tag’s children :return: the unfiltered/unempty children if element is a Tag, else []