Text Parsers¶

Base Text Parser¶

class html2ans.parsers.text.AbstractTextParser[source]¶

Bases: html2ans.parsers.base.BaseElementParser

Abstract parser for text-only elements (NavigableString, p, etc.).

construct_output(element, *args, **kwargs)[source]¶

Convenience method for constructing an output dictionary. If element is a Tag with attributes, those attributes will be stashed in additional_properties.

Parameters

element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element being parsed
ans_type (str) – the ANS type to put in the output type field
content (str) – the content to put in the content field
version (str) – the version to put in the version field. Note: if not provided but version_required=True on this parser, the output will receive a version from the root parser

Basic Text Parser¶

class html2ans.parsers.text.ParagraphParser[source]¶

Bases: html2ans.parsers.text.AbstractTextParser

Paragraph parser. This parser does not remove text-formatting tags like em, b, i, etc. OR inline links. What is or isn’t removed by this parser can be adjusted by updating the TEXT_TAGS field which is inherited from html2ans.parsers.utils.AbstractParserUtilities. ANS schema

Example:

<p>Post Reports is the daily podcast from <a href="https://www.washingtonpost.com">The Washington Post</a></p>

->

{
    "type": "text",
    "content": "Post Reports is the daily podcast from <a href="https://www.washingtonpost.com">The Washington Post</a>"
}

parse(element, *args, **kwargs)[source]¶

Parses the given element.

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

Formatted Text Parser¶

class html2ans.parsers.text.FormattedTextParser[source]¶

Bases: html2ans.parsers.text.AbstractTextParser

Formatted text parser. This parser does not remove text-formatting tags like em, b, i, etc. ANS schema

Example:

<em>Post Reports</em>

->

{
    "type": "text",
    "content": "<em>Post Reports</em>"
}

parse(element, *args, **kwargs)[source]¶

Parses the given element.

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

Blockquote Parser¶

class html2ans.parsers.text.BlockquoteParser[source]¶

Bases: html2ans.parsers.text.AbstractTextParser

Blockquote parser. ANS schema

Example:

<blockquote>
    <p>Post Reports is the daily podcast from The Washington Post.</p>
    <p>Unparalleled reporting.</p>
    <p>Expert insight.</p>
    <p>Clear analysis.</p>
</blockquote>

->

{
    "type": "quote",
    "content_elements": [
        {
            "type": "text",
            "content": "Post Reports is the daily podcast from The Washington Post."
        },
        {
            "type": "text",
            "content": "Unparalleled reporting."
        },
        {
            "type": "text",
            "content": "Expert insight."
        },
        {
            "type": "text",
            "content": "Clear analysis."
        }
    ]
}

parse(element, *args, **kwargs)[source]¶

Parses the given element.

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

Header Parser¶

class html2ans.parsers.text.HeaderParser[source]¶

Bases: html2ans.parsers.base.BaseElementParser

Header parser. ANS schema

Example:

<h1>Post Reports</h1>

->

{
    "type": "header",
    "level": 1,
    "content": "Post Reports"
}

parse(element, *args, **kwargs)[source]¶

Parses the given element.

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

Interstitial Link Parser¶

class html2ans.parsers.text.InterstitialLinkParser[source]¶

Bases: html2ans.parsers.base.BaseElementParser

Converts links in anchor elements into ANS elements of type interstitial_link. ANS schema

Example:

<a href="https://www.washingtonpost.com">The Washington Post</a>

->

{
    "type": "interstitial_link",
    "url": "https://www.washingtonpost.com",
    "content": "The Washington Post"
}

parse(element, *args, **kwargs)[source]¶

Parses the given element.

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse

List Item Parser¶

class html2ans.parsers.text.ListItemParser[source]¶

Bases: html2ans.parsers.text.ParagraphParser

Parses a single list item tag as paragraph text ANS elements of type text. As of ANS 0.6.2, list items have to be either text or another list (in the case of another list, the ListParser is used recursively). ANS schema

Example:

<li>
    Post Reports,
    <a href="/podcast/">a daily podcast</a>
    from The Washington Post.
</li>

->

{
    "type": "text",
    "content": "Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post."
}

List Parser¶

class html2ans.parsers.text.ListParser(list_item_parser=None)[source]¶

Bases: html2ans.parsers.base.BaseElementParser

List parser. ANS schema

Example:

<ul>
    <li>
        Post Reports,
        <a href="/podcast/">a daily podcast</a>
        from The Washington Post.
    </li>
    <li>
        <ol>
            <li>Unparalleled reporting.</li>
            <li>Expert insight.</li>
        </ol>
    <li><p>Clear analysis.</p></li>
</ul>

->

{
    'type': 'list',
    'list_type': 'unordered',
    'items': [
        {
            'type': 'text',
            'content': 'Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post.'
        },
        {
            'type': 'list',
            'list_type': 'ordered',
            'items': [
                {
                    'type': 'text',
                    'content': 'Unparalleled reporting.'
                },
                {
                    'type': 'text',
                    'content': 'Expert insight.'
                }
            ]
        },
        {
            'type': 'text',
            'content': '<p>Clear analysis.</p>'
        }
    ]
}

Parameters: list_item_parser (ElementParser) – the parser to use on individual list elements (defaults to ListItemParser)

parse(element, *args, **kwargs)[source]¶

Parses the given element.

Parameters: element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element to parse