
Customizing the fetch process 45Ĭalibre User Manual, Release 3.22.1 The final new feature is the .BasicNewsRecipe.preprocess_html() (page 50) method.

To use them, replace the call to index_to_soup() with the following: raw = self.index_to_soup(url, raw=True) # For html5lib import html5lib root = html5lib.parse(raw, namespaceHTMLElements=False, treebuilder='lxml') # For the lxml html 4 parser from lxml import html root = omstring(raw) 6 7 8 9 2.2. calibre comes with lxml8 and html5lib9, which are the recommended parsers. You can also use other, more modern parsers if you dislike BeatifulSoup. parse_index makes heavy use of BeautifulSoup7 to parse the daily paper webpage. While more complex than simply using RSS, the recipe creates an e-book that corresponds very closely to the days paper. Its job is to go to and fetch the list of articles that appear in todays paper. The next new feature is the .BasicNewsRecipe.parse_index() (page 49) method. See mechanize6 to understand the code in get_browser. Once logged in, calibre will use the same, logged in, browser instance to fetch all content. The code in .BasicNewsRecipe.get_browser() (page 48) actually does the login into the NYT website. This causes, calibre to ask for a username and password whenever you try to use this recipe. needs_subscription = True tells calibre that this recipe needs a username and password in order to access the content. The next interesting feature is: needs_subscription = True. See remove_tags (page 54), remove_tags_before (page 54), remove_tags_after (page 54).


These remove everything before the first tag and everything after the first tag whose id is footer. Then we see a group of directives to cleanup the downloaded HTML: remove_tags_before = dict(name='h1') remove_tags_after = dict(id='footer') remove_tags =. First, we have: timefmt = ' ' This sets the displayed time on the front page of the created e-book to be in the format, Day, Day_Number Month, Year. Content).read() return BeautifulSoup(raw.decode('cp1252', 'replace')) We see several new features in this recipe.
