Text Parsers¶
Base Text Parser¶
-
class
html2ans.parsers.text.AbstractTextParser[source]¶ Bases:
html2ans.parsers.base.BaseElementParserAbstract parser for text-only elements (
NavigableString,p, etc.).-
construct_output(element, *args, **kwargs)[source]¶ Convenience method for constructing an output dictionary. If element is a
Tagwith attributes, those attributes will be stashed inadditional_properties.- Parameters
element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element being parsed
ans_type (str) – the ANS type to put in the output
typefieldcontent (str) – the content to put in the
contentfieldversion (str) – the version to put in the
versionfield. Note: if not provided butversion_required=Trueon this parser, the output will receive a version from the root parser
-
Basic Text Parser¶
-
class
html2ans.parsers.text.ParagraphParser[source]¶ Bases:
html2ans.parsers.text.AbstractTextParserParagraph parser. This parser does not remove text-formatting tags like
em,b,i, etc. OR inline links. What is or isn’t removed by this parser can be adjusted by updating theTEXT_TAGSfield which is inherited fromhtml2ans.parsers.utils.AbstractParserUtilities. ANS schemaExample:
<p>Post Reports is the daily podcast from <a href="https://www.washingtonpost.com">The Washington Post</a></p>
->
{ "type": "text", "content": "Post Reports is the daily podcast from <a href="https://www.washingtonpost.com">The Washington Post</a>" }
Formatted Text Parser¶
-
class
html2ans.parsers.text.FormattedTextParser[source]¶ Bases:
html2ans.parsers.text.AbstractTextParserFormatted text parser. This parser does not remove text-formatting tags like
em,b,i, etc. ANS schemaExample:
<em>Post Reports</em>
->
{ "type": "text", "content": "<em>Post Reports</em>" }
Blockquote Parser¶
-
class
html2ans.parsers.text.BlockquoteParser[source]¶ Bases:
html2ans.parsers.text.AbstractTextParserBlockquote parser. ANS schema
Example:
<blockquote> <p>Post Reports is the daily podcast from The Washington Post.</p> <p>Unparalleled reporting.</p> <p>Expert insight.</p> <p>Clear analysis.</p> </blockquote>
->
{ "type": "quote", "content_elements": [ { "type": "text", "content": "Post Reports is the daily podcast from The Washington Post." }, { "type": "text", "content": "Unparalleled reporting." }, { "type": "text", "content": "Expert insight." }, { "type": "text", "content": "Clear analysis." } ] }
Header Parser¶
-
class
html2ans.parsers.text.HeaderParser[source]¶ Bases:
html2ans.parsers.base.BaseElementParserHeader parser. ANS schema
Example:
<h1>Post Reports</h1>
->
{ "type": "header", "level": 1, "content": "Post Reports" }
Interstitial Link Parser¶
-
class
html2ans.parsers.text.InterstitialLinkParser[source]¶ Bases:
html2ans.parsers.base.BaseElementParserConverts links in anchor elements into ANS elements of type
interstitial_link. ANS schemaExample:
<a href="https://www.washingtonpost.com">The Washington Post</a>
->
{ "type": "interstitial_link", "url": "https://www.washingtonpost.com", "content": "The Washington Post" }
List Item Parser¶
-
class
html2ans.parsers.text.ListItemParser[source]¶ Bases:
html2ans.parsers.text.ParagraphParserParses a single list item tag as paragraph text ANS elements of type
text. As of ANS 0.6.2, list items have to be either text or another list (in the case of another list, theListParseris used recursively). ANS schemaExample:
<li> Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post. </li>
->
{ "type": "text", "content": "Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post." }
List Parser¶
-
class
html2ans.parsers.text.ListParser(list_item_parser=None)[source]¶ Bases:
html2ans.parsers.base.BaseElementParserList parser. ANS schema
Example:
<ul> <li> Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post. </li> <li> <ol> <li>Unparalleled reporting.</li> <li>Expert insight.</li> </ol> <li><p>Clear analysis.</p></li> </ul>
->
{ 'type': 'list', 'list_type': 'unordered', 'items': [ { 'type': 'text', 'content': 'Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post.' }, { 'type': 'list', 'list_type': 'ordered', 'items': [ { 'type': 'text', 'content': 'Unparalleled reporting.' }, { 'type': 'text', 'content': 'Expert insight.' } ] }, { 'type': 'text', 'content': '<p>Clear analysis.</p>' } ] }
- Parameters
list_item_parser (ElementParser) – the parser to use on individual list elements (defaults to
ListItemParser)