Text Parsers¶
Base Text Parser¶
-
class
html2ans.parsers.text.
AbstractTextParser
[source]¶ Bases:
html2ans.parsers.base.BaseElementParser
Abstract parser for text-only elements (
NavigableString
,p
, etc.).-
construct_output
(element, *args, **kwargs)[source]¶ Convenience method for constructing an output dictionary. If element is a
Tag
with attributes, those attributes will be stashed inadditional_properties
.- Parameters
element (bs4.element.Tag or bs4.element.Comment or bs4.element.NavigableString) – the element being parsed
ans_type (str) – the ANS type to put in the output
type
fieldcontent (str) – the content to put in the
content
fieldversion (str) – the version to put in the
version
field. Note: if not provided butversion_required=True
on this parser, the output will receive a version from the root parser
-
Basic Text Parser¶
-
class
html2ans.parsers.text.
ParagraphParser
[source]¶ Bases:
html2ans.parsers.text.AbstractTextParser
Paragraph parser. This parser does not remove text-formatting tags like
em
,b
,i
, etc. OR inline links. What is or isn’t removed by this parser can be adjusted by updating theTEXT_TAGS
field which is inherited fromhtml2ans.parsers.utils.AbstractParserUtilities
. ANS schemaExample:
<p>Post Reports is the daily podcast from <a href="https://www.washingtonpost.com">The Washington Post</a></p>
->
{ "type": "text", "content": "Post Reports is the daily podcast from <a href="https://www.washingtonpost.com">The Washington Post</a>" }
Formatted Text Parser¶
-
class
html2ans.parsers.text.
FormattedTextParser
[source]¶ Bases:
html2ans.parsers.text.AbstractTextParser
Formatted text parser. This parser does not remove text-formatting tags like
em
,b
,i
, etc. ANS schemaExample:
<em>Post Reports</em>
->
{ "type": "text", "content": "<em>Post Reports</em>" }
Blockquote Parser¶
-
class
html2ans.parsers.text.
BlockquoteParser
[source]¶ Bases:
html2ans.parsers.text.AbstractTextParser
Blockquote parser. ANS schema
Example:
<blockquote> <p>Post Reports is the daily podcast from The Washington Post.</p> <p>Unparalleled reporting.</p> <p>Expert insight.</p> <p>Clear analysis.</p> </blockquote>
->
{ "type": "quote", "content_elements": [ { "type": "text", "content": "Post Reports is the daily podcast from The Washington Post." }, { "type": "text", "content": "Unparalleled reporting." }, { "type": "text", "content": "Expert insight." }, { "type": "text", "content": "Clear analysis." } ] }
Header Parser¶
-
class
html2ans.parsers.text.
HeaderParser
[source]¶ Bases:
html2ans.parsers.base.BaseElementParser
Header parser. ANS schema
Example:
<h1>Post Reports</h1>
->
{ "type": "header", "level": 1, "content": "Post Reports" }
Interstitial Link Parser¶
-
class
html2ans.parsers.text.
InterstitialLinkParser
[source]¶ Bases:
html2ans.parsers.base.BaseElementParser
Converts links in anchor elements into ANS elements of type
interstitial_link
. ANS schemaExample:
<a href="https://www.washingtonpost.com">The Washington Post</a>
->
{ "type": "interstitial_link", "url": "https://www.washingtonpost.com", "content": "The Washington Post" }
List Item Parser¶
-
class
html2ans.parsers.text.
ListItemParser
[source]¶ Bases:
html2ans.parsers.text.ParagraphParser
Parses a single list item tag as paragraph text ANS elements of type
text
. As of ANS 0.6.2, list items have to be either text or another list (in the case of another list, theListParser
is used recursively). ANS schemaExample:
<li> Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post. </li>
->
{ "type": "text", "content": "Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post." }
List Parser¶
-
class
html2ans.parsers.text.
ListParser
(list_item_parser=None)[source]¶ Bases:
html2ans.parsers.base.BaseElementParser
List parser. ANS schema
Example:
<ul> <li> Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post. </li> <li> <ol> <li>Unparalleled reporting.</li> <li>Expert insight.</li> </ol> <li><p>Clear analysis.</p></li> </ul>
->
{ 'type': 'list', 'list_type': 'unordered', 'items': [ { 'type': 'text', 'content': 'Post Reports, <a href="/podcast/">a daily podcast</a> from The Washington Post.' }, { 'type': 'list', 'list_type': 'ordered', 'items': [ { 'type': 'text', 'content': 'Unparalleled reporting.' }, { 'type': 'text', 'content': 'Expert insight.' } ] }, { 'type': 'text', 'content': '<p>Clear analysis.</p>' } ] }
- Parameters
list_item_parser (ElementParser) – the parser to use on individual list elements (defaults to
ListItemParser
)