Document Parsing¶
Interface¶
Base HTML Parser¶
-
class
html2ans.base.
BaseHtmlAnsParser
(ans_version=None, soup_parse_lib='lxml', suppress_exceptions=False, default_parsers=None, *args, **kwargs)[source]¶ Bases:
html2ans.base.AbstractHtmlAnsParser
,html2ans.parsers.utils.AbstractParserUtilities
The base root/top-level parser class; assumes elements will be parsed with
BeautifulSoup
. Use this class to generate a list of ANS elements from an HTML document. HTML elements within the document will be parsed using element parsers present in theparsers
attribute. Theparsers
attribute is populated first with the parsers from theDEFAULT_PARSERS
variable. Other parsers can be added using this class’sadd_parser
andinsert_parser
methods.Attempts at parsing an element will use each parser in an element’s parser list in order. If a parser isn’t applicable (
is_applicable
returnsFalse
) orparse
indicates this element wasn’t a match, the next parser is tried. Why are there two ways of indicating if an element/parser are compatable? You can’t always tell if an element/parser are compatible until you’re parsing the element.is_applicable
catches early, obvious issues (like “this parser is forimg
elements; is this animg
?).Using an
ans_version
won’t affect the output of these parsers in terms of actual ANS version compatibility; it is provided as a convenience (sometimes you need to update the overall ANS version because of a specific new feature, but won’t want to update the output of all parsers).Use the
suppress_exceptions
option to treat element parsing exceptions as non-matches. Withsuppress_exceptions
, when exceptions are thrown, the next parser will be tried (as thoughis_applicable
returnedFalse
).- Parameters
ans_version (str) – the ANS version to apply to the output of parsers that require a version
bs_parse_lib (str) – the BeautifulSoup parsing library to use
suppress_exceptions (bool) – whether or not to suppress exceptions during element parsing
default_parsers (list) – The default parsers to populate
parsers
with. Order matters here!
-
BACKUP_PARSERS
= []¶ The backup parsers for this class to use. This list will be used on _every_ parsing attempt if the primary parsers for a given type don’t match. For example, if
parsers
contains'p': [ParagraphParser()]
and an exception is thrown when processing a paragraph tag using theParagraphParser
, all backup parsers will also be tried. In the default implementation, this essentially means that producing raw_html will be the last resort when all other parsers fail.
-
parsers
= None¶ A mapping of potential HTML elements to a list of parsers to use to attempt to parse elements of that type
-
generate_ans
(html, start_tag='body', *args, **kwargs)[source]¶ Parses html and produces ANS in a jsonify-able format.
- Parameters
html (str) – the html to parse
start_tag (str) – where to start parsing (if not provided, all tags will be parsed)
- Returns
a list of ANS elements as dictionaries
-
insert_parser
(element_key, parser, position=None, *args, **kwargs)[source]¶ Insert a parser of the given type into the list of parsers for that type.
- Parameters
element_key (str) – The key of the element in self.parsers (e.g. ‘p’)
parser (html2ans.parsers.base.ElementParser) – The parser object to insert
position (int) – Where to insert the parser in the list
-
add_parser
(parser, *args, **kwargs)[source]¶ Add a parser to
self.parsers
using the parser’sapplicable_elements
.- Parameters
parser (html2ans.parsers.base.BaseElementParser) – The parser object to insert
Default HTML Parser¶
-
class
html2ans.default.
DefaultHtmlAnsParser
(*args, **kwargs)[source]¶ Bases:
html2ans.base.BaseHtmlAnsParser
The default root/top-level parser class.
-
DEFAULT_PARSERS
= [<html2ans.parsers.embeds.ArcPlayerEmbedParser object>, <html2ans.parsers.embeds.DailyMotionEmbedParser object>, <html2ans.parsers.embeds.FacebookPostEmbedParser object>, <html2ans.parsers.embeds.FacebookVideoEmbedParser object>, <html2ans.parsers.embeds.FlickrEmbedParser object>, <html2ans.parsers.embeds.ImgurEmbedParser object>, <html2ans.parsers.embeds.InstagramEmbedParser object>, <html2ans.parsers.embeds.PollDaddyEmbedParser object>, <html2ans.parsers.embeds.RedditEmbedParser object>, <html2ans.parsers.embeds.SpotifyEmbedParser object>, <html2ans.parsers.embeds.TumblrEmbedParser object>, <html2ans.parsers.embeds.TwitterTweetEmbedParser object>, <html2ans.parsers.embeds.TwitterVideoEmbedParser object>, <html2ans.parsers.embeds.YoutubeEmbedParser object>, <html2ans.parsers.embeds.VimeoEmbedParser object>, <html2ans.parsers.embeds.VineEmbedParser object>, <html2ans.parsers.text.HeaderParser object>, <html2ans.parsers.text.ListParser object>, <html2ans.parsers.text.FormattedTextParser object>, <html2ans.parsers.text.BlockquoteParser object>, <html2ans.parsers.text.ParagraphParser object>, <html2ans.parsers.image.LinkedImageParser object>, <html2ans.parsers.image.ImageParser object>, <html2ans.parsers.image.FigureParser object>, <html2ans.parsers.text.InterstitialLinkParser object>, <html2ans.parsers.audio.AudioParser object>, <html2ans.parsers.embeds.IFrameParser object>, <html2ans.parsers.base.NullParser object>]¶ Default parsers for the default implementation. These will be added to the BaseHtmlAnsParser
parsers
attribute in the order listed, so order matters!
-
BACKUP_PARSERS
= [<html2ans.parsers.embeds.IFrameParser object>, <html2ans.parsers.raw_html.RawHtmlParser object>]¶ Backup parsers for the default implementation. These will be tried in the order listed, so order matters!
-