Document Parsing¶
Interface¶
Base HTML Parser¶
-
class
html2ans.base.BaseHtmlAnsParser(ans_version=None, soup_parse_lib='lxml', suppress_exceptions=False, default_parsers=None, *args, **kwargs)[source]¶ Bases:
html2ans.base.AbstractHtmlAnsParser,html2ans.parsers.utils.AbstractParserUtilitiesThe base root/top-level parser class; assumes elements will be parsed with
BeautifulSoup. Use this class to generate a list of ANS elements from an HTML document. HTML elements within the document will be parsed using element parsers present in theparsersattribute. Theparsersattribute is populated first with the parsers from theDEFAULT_PARSERSvariable. Other parsers can be added using this class’sadd_parserandinsert_parsermethods.Attempts at parsing an element will use each parser in an element’s parser list in order. If a parser isn’t applicable (
is_applicablereturnsFalse) orparseindicates this element wasn’t a match, the next parser is tried. Why are there two ways of indicating if an element/parser are compatable? You can’t always tell if an element/parser are compatible until you’re parsing the element.is_applicablecatches early, obvious issues (like “this parser is forimgelements; is this animg?).Using an
ans_versionwon’t affect the output of these parsers in terms of actual ANS version compatibility; it is provided as a convenience (sometimes you need to update the overall ANS version because of a specific new feature, but won’t want to update the output of all parsers).Use the
suppress_exceptionsoption to treat element parsing exceptions as non-matches. Withsuppress_exceptions, when exceptions are thrown, the next parser will be tried (as thoughis_applicablereturnedFalse).- Parameters
ans_version (str) – the ANS version to apply to the output of parsers that require a version
bs_parse_lib (str) – the BeautifulSoup parsing library to use
suppress_exceptions (bool) – whether or not to suppress exceptions during element parsing
default_parsers (list) – The default parsers to populate
parserswith. Order matters here!
-
BACKUP_PARSERS= []¶ The backup parsers for this class to use. This list will be used on _every_ parsing attempt if the primary parsers for a given type don’t match. For example, if
parserscontains'p': [ParagraphParser()]and an exception is thrown when processing a paragraph tag using theParagraphParser, all backup parsers will also be tried. In the default implementation, this essentially means that producing raw_html will be the last resort when all other parsers fail.
-
parsers= None¶ A mapping of potential HTML elements to a list of parsers to use to attempt to parse elements of that type
-
generate_ans(html, start_tag='body', *args, **kwargs)[source]¶ Parses html and produces ANS in a jsonify-able format.
- Parameters
html (str) – the html to parse
start_tag (str) – where to start parsing (if not provided, all tags will be parsed)
- Returns
a list of ANS elements as dictionaries
-
insert_parser(element_key, parser, position=None, *args, **kwargs)[source]¶ Insert a parser of the given type into the list of parsers for that type.
- Parameters
element_key (str) – The key of the element in self.parsers (e.g. ‘p’)
parser (html2ans.parsers.base.ElementParser) – The parser object to insert
position (int) – Where to insert the parser in the list
-
add_parser(parser, *args, **kwargs)[source]¶ Add a parser to
self.parsersusing the parser’sapplicable_elements.- Parameters
parser (html2ans.parsers.base.BaseElementParser) – The parser object to insert
Default HTML Parser¶
-
class
html2ans.default.DefaultHtmlAnsParser(*args, **kwargs)[source]¶ Bases:
html2ans.base.BaseHtmlAnsParserThe default root/top-level parser class.
-
DEFAULT_PARSERS= [<html2ans.parsers.embeds.ArcPlayerEmbedParser object>, <html2ans.parsers.embeds.DailyMotionEmbedParser object>, <html2ans.parsers.embeds.FacebookPostEmbedParser object>, <html2ans.parsers.embeds.FacebookVideoEmbedParser object>, <html2ans.parsers.embeds.FlickrEmbedParser object>, <html2ans.parsers.embeds.ImgurEmbedParser object>, <html2ans.parsers.embeds.InstagramEmbedParser object>, <html2ans.parsers.embeds.PollDaddyEmbedParser object>, <html2ans.parsers.embeds.RedditEmbedParser object>, <html2ans.parsers.embeds.SpotifyEmbedParser object>, <html2ans.parsers.embeds.TumblrEmbedParser object>, <html2ans.parsers.embeds.TwitterTweetEmbedParser object>, <html2ans.parsers.embeds.TwitterVideoEmbedParser object>, <html2ans.parsers.embeds.YoutubeEmbedParser object>, <html2ans.parsers.embeds.VimeoEmbedParser object>, <html2ans.parsers.embeds.VineEmbedParser object>, <html2ans.parsers.text.HeaderParser object>, <html2ans.parsers.text.ListParser object>, <html2ans.parsers.text.FormattedTextParser object>, <html2ans.parsers.text.BlockquoteParser object>, <html2ans.parsers.text.ParagraphParser object>, <html2ans.parsers.image.LinkedImageParser object>, <html2ans.parsers.image.ImageParser object>, <html2ans.parsers.image.FigureParser object>, <html2ans.parsers.text.InterstitialLinkParser object>, <html2ans.parsers.audio.AudioParser object>, <html2ans.parsers.embeds.IFrameParser object>, <html2ans.parsers.base.NullParser object>]¶ Default parsers for the default implementation. These will be added to the BaseHtmlAnsParser
parsersattribute in the order listed, so order matters!
-
BACKUP_PARSERS= [<html2ans.parsers.embeds.IFrameParser object>, <html2ans.parsers.raw_html.RawHtmlParser object>]¶ Backup parsers for the default implementation. These will be tried in the order listed, so order matters!
-