Text Parsers

Base Text Parser

class html2ans.parsers.text.AbstractTextParser[source]

Bases: html2ans.parsers.base.BaseElementParser

Abstract parser for text-only elements (NavigableString, p, etc.).

construct_output(element, *args, **kwargs)[source]

Convenience method for constructing an output dictionary. If element is a Tag with attributes, those attributes will be stashed in additional_properties.

Parameters
  • element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element being parsed

  • ans_type (str) – the ANS type to put in the output type field

  • content (str) – the content to put in the content field

  • version (str) – the version to put in the version field. Note: if not provided but version_required=True on this parser, the output will receive a version from the root parser

Basic Text Parser

class html2ans.parsers.text.ParagraphParser[source]

Bases: html2ans.parsers.text.AbstractTextParser

Paragraph parser. This parser does not remove text-formatting tags like em, b, i, etc. OR inline links. What is or isn’t removed by this parser can be adjusted by updating the TEXT_TAGS field which is inherited from html2ans.parsers.utils.AbstractParserUtilities. ANS schema

Example:

<p>Post Reports is the daily podcast from <a href="https://www.washingtonpost.com">The Washington Post</a></p>

->

{
    "type": "text",
    "content": "Post Reports is the daily podcast from <a href="https://www.washingtonpost.com">The Washington Post</a>"
}
parse(element, *args, **kwargs)[source]

Parses the given element.

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

Formatted Text Parser

class html2ans.parsers.text.FormattedTextParser[source]

Bases: html2ans.parsers.text.AbstractTextParser

Formatted text parser. This parser does not remove text-formatting tags like em, b, i, etc. ANS schema

Example:

<em>Post Reports</em>

->

{
    "type": "text",
    "content": "<em>Post Reports</em>"
}
parse(element, *args, **kwargs)[source]

Parses the given element.

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

Blockquote Parser

class html2ans.parsers.text.BlockquoteParser[source]

Bases: html2ans.parsers.text.AbstractTextParser

Blockquote parser. ANS schema

Example:

<blockquote>
    <p>Post Reports is the daily podcast from The Washington Post.</p>
    <p>Unparalleled reporting.</p>
    <p>Expert insight.</p>
    <p>Clear analysis.</p>
</blockquote>

->

{
    "type": "quote",
    "content_elements": [
        {
            "type": "text",
            "content": "Post Reports is the daily podcast from The Washington Post."
        },
        {
            "type": "text",
            "content": "Unparalleled reporting."
        },
        {
            "type": "text",
            "content": "Expert insight."
        },
        {
            "type": "text",
            "content": "Clear analysis."
        }
    ]
}
parse(element, *args, **kwargs)[source]

Parses the given element.

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

Header Parser

class html2ans.parsers.text.HeaderParser[source]

Bases: html2ans.parsers.base.BaseElementParser

Header parser. ANS schema

Example:

<h1>Post Reports</h1>

->

{
    "type": "header",
    "level": 1,
    "content": "Post Reports"
}
parse(element, *args, **kwargs)[source]

Parses the given element.

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

List Item Parser

class html2ans.parsers.text.ListItemParser[source]

Bases: html2ans.parsers.text.ParagraphParser

Parses a single list item tag as paragraph text ANS elements of type text. As of ANS 0.6.2, list items have to be either text or another list (in the case of another list, the ListParser is used recursively). ANS schema

Example:

<li>
    Post Reports,
    <a href="/podcast/">a daily podcast</a>
    from The Washington Post.
</li>

->

{
    "type": "text",
    "content": "Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post."
}

List Parser

class html2ans.parsers.text.ListParser(list_item_parser=None)[source]

Bases: html2ans.parsers.base.BaseElementParser

List parser. ANS schema

Example:

<ul>
    <li>
        Post Reports,
        <a href="/podcast/">a daily podcast</a>
        from The Washington Post.
    </li>
    <li>
        <ol>
            <li>Unparalleled reporting.</li>
            <li>Expert insight.</li>
        </ol>
    <li><p>Clear analysis.</p></li>
</ul>

->

{
    'type': 'list',
    'list_type': 'unordered',
    'items': [
        {
            'type': 'text',
            'content': 'Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post.'
        },
        {
            'type': 'list',
            'list_type': 'ordered',
            'items': [
                {
                    'type': 'text',
                    'content': 'Unparalleled reporting.'
                },
                {
                    'type': 'text',
                    'content': 'Expert insight.'
                }
            ]
        },
        {
            'type': 'text',
            'content': '<p>Clear analysis.</p>'
        }
    ]
}
Parameters

list_item_parser (ElementParser) – the parser to use on individual list elements (defaults to ListItemParser)

parse(element, *args, **kwargs)[source]

Parses the given element.

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse