XML Walker section

An XML walker source section yields a hierarchy of items by iterating over an `lxml.etree`_ tree of XML elements that match an `XPath`_. This can be used to build content structure based on the sitemap or navigation of a HTML web site.

Options starting with element- may contain expressions whose value will be inserted into the element items. The expressions have access to the following:

element

the current walked element

item

the current walked element item to be yielded

source_item

the original item containing the walked tree

tree

the original walked tree

transmogrifier

the transmogrifier

name

the name of the inserter section

options

the inserter options

modules

sys.modules

Start with an HTML file containing a heirarchical navbar.

>>> import os
>>> html_file = os.path.join(
...     os.path.dirname(__file__), 'xmlwalker.html')
>>> infologger = """
... [transmogrifier]
... pipeline =
...     source
...     parse
...     walk
...     defaultpage
...     clean
...     logger
...
... [source]
... blueprint = collective.transmogrifier.sections.tests.rangesource
... size = 1
...
... [parse]
... blueprint = collective.transmogrifier.sections.inserter
... key = string:_trees
... value = python:modules['lxml.html'].parse('{}').xpath(\
...     "//*[contains(@class, 'navbar')]//ul[contains(@class, 'nav')]")
...
... [walk]
... blueprint = collective.transmogrifier.sections.xmlwalker
... element-keys =
...     _path
...     title
... element-_path = python:element.attrib.get(\
...     'href', element.attrib.get('src', ''))
... element-title = python:element.text_content().strip()\
...     or element.attrib.get('alt', '')
...
... [defaultpage]
... blueprint = collective.transmogrifier.sections.inserter
... key = string:_defaultpage
... condition = python:item.get('_parent', dict()).pop('_parent', True)\
...     and item.get('_defaultpage')
... value = exists:item/_defaultpage
...
... [clean]
... blueprint = collective.transmogrifier.sections.manipulator
... delete =
...     _trees
...     _element
...     id
...
... [logger]
... blueprint = collective.transmogrifier.sections.logger
... name = logger
... level = INFO
... """.format(html_file)
>>> registerConfig('collective.transmogrifier.sections.tests.xmlwalker',
...                infologger)
>>> transmogrifier('collective.transmogrifier.sections.tests.xmlwalker')
>>> print(handler)
logger INFO
  {}
logger INFO
  {'_parent': {}, '_path': '#', '_type': 'Folder', 'title': 'Foo Tab'}
logger INFO
    {'_is_defaultpage': True,
   '_parent': {'_path': '#', '_type': 'Folder', 'title': 'Foo Tab'},
   '_path': '#',
   'title': 'Foo Tab'}
logger INFO
    {'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Foo Tab'},
   '_path': '../foo-tab/index.html',
   'title': 'Foo Tab Default Page'}
logger INFO
    {'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Foo Tab'},
   '_path': '../foo-tab/bar-image.png',
   'title': 'Bar Image'}
logger INFO
    {'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Foo Tab'},
   '_path': '../foo-tab/qux-page.html',
   'title': 'Qux Page'}
logger INFO
  {'_parent': {}, '_path': '#', '_type': 'Folder', 'title': 'Company'}
logger INFO
    {'_is_defaultpage': True,
   '_parent': {'_path': '#', '_type': 'Folder', 'title': 'Company'},
   '_path': '#',
   'title': 'Company'}
logger INFO
    {'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Company'},
   '_path': '../company/news.html',
   '_type': 'Folder',
   'title': 'News'}
logger INFO
    {'_is_defaultpage': True,
   '_parent': {'_path': '../company/news.html',
               '_type': 'Folder',
               'title': 'News'},
   '_path': '../company/news.html',
   'title': 'News'}
logger INFO
    {'_parent': {'_path': '../company/news.html',
               '_type': 'Folder',
               'title': 'News'},
   '_path': '../company/news.html',
   'title': 'News'}
logger INFO
    {'_parent': {'_path': '../company/news.html',
               '_type': 'Folder',
               'title': 'News'},
   '_path': '../company/press_releases.html',
   'title': 'Press Releases'}
logger INFO
    {'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Company'},
   '_path': '../company/events.html',
   'title': 'Events'}
logger INFO
    {'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Company'},
   '_path': '../contact_us/contact.html',
   'title': 'Contact Us'}
logger INFO
    {'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Company'},
   '_path': '../company/index.html',
   '_type': 'Folder',
   'title': 'About Company'}
logger INFO
    {'_is_defaultpage': True,
   '_parent': {'_path': '../company/index.html',
               '_type': 'Folder',
               'title': 'About Company'},
   '_path': '../company/index.html',
   'title': 'About Company'}
logger INFO
    {'_parent': {'_path': '../company/index.html',
               '_type': 'Folder',
               'title': 'About Company'},
   '_path': '../company/management.html',
   'title': 'Management'}
logger INFO
    {'_parent': {'_path': '../company/index.html',
               '_type': 'Folder',
               'title': 'About Company'},
   '_path': '../company/investors.html',
   'title': 'Investors'}
logger INFO
    {'_parent': {'_path': '../company/index.html',
               '_type': 'Folder',
               'title': 'About Company'},
   '_path': '../company/careers.html',
   'title': 'Careers'}
logger INFO
    {'_parent': {'_path': '../company/index.html',
               '_type': 'Folder',
               'title': 'About Company'},
   '_path': '../company/company.html',
   'title': 'About Us'}