Python XML parsing |
2009-07-23
|
I've always had this nagging feeling any time I write XML parsing code in Python that it just ought to be simpler.
Conceptually, I tend to always think that an XML document is really just a big nested python dictionary, and I should be able to walk it as such, or even better, as a composition of objects. That's probably a gross oversimplification and probably not correct according to spec or some corner use cases, but really, that's just how my brain works. Show me an XML document and I'll show you a big dictionary tree.
So I've always been disappointed with the experience of having to write actual XML parsing code. In Flumotion I think we've gone through three different XML parsers as well, depending on which developer was supposed to rewrite or add some piece of code that had to deal with XML config files.
So each time I have to write some XML parsing code again, I tend to go look if there's something better out there. This time I stumbled across a blog post that echoed my sentiments exactly. And it came with a solution: xml.etree.ElementTree. Finally ! A library that more or less maps to my mental model. Reading through the tutorial then writing the code I needed to parse my file took a lot less time than handrolling yet another set of functions to parse tags would have.
I'm posting this to increase the google juice of xml.etree.ElementTree. It simply doesn't show up when you google for python xml, and it should be hit number 1!
ElementTree is very Useful! I use it since Python 2.4
NB:
in python 2.4
from elementtree import ElementTree as ET
python 2.6 and >
from xml.etree import ElementTree as ET
Comment by jmanteau — 2009-07-23 @ 11:23
I came across ElementTree a couple of years ago, and I felt exactly the same (it’s fantastic). XML stores data in trees, so why do most XML libraries focus on parsing rather than just doing all the background work and letting you access a tree of data?!
Comment by Nick — 2009-07-23 @ 11:27
ElementTree stores the tree in RAM, right? So it is not a feasible solution, for really large xml-files right?
thanks,
Paula
Comment by Paula — 2009-07-23 @ 12:37
Didn’t like http://www.crummy.com/software/BeautifulSoup/documentation.html ?
Comment by Michael DeHaan — 2009-07-23 @ 12:38
It’s part of the standard library, ALWAYS have a look there first if you need a solution for anything.
Comment by Markus — 2009-07-23 @ 12:39
http://www.rexx.com/~dkuhlman/pyxmlfaq.html
talks about ElementTree and comes up high in the google search query for python xml
Comment by Zaheer Merali — 2009-07-23 @ 12:41
Hello,
the ElemenetTree-API ist really nice. But you may have a look at the brilliant lxml-Module! There you have much more stuff besides ElementTree (for example the “Classify” method for easy mapping xml into classes.)
Comment by Christian — 2009-07-23 @ 13:07
You didn’t look too hard previously, ElementTree is a pretty famous XML lib for Python (and it’s compatible with every Python from 1.6). Though recently effbot hasn’t had to much time for it, so it’s languishing (there are major issues in namespace handling, especially when you mix default and qualified namespaces in 1.2, some of them are fixed in the 1.3 betas but not everything).
You might also want to check lxml: it’s pretty good, and it has an ElementTree-like interface (so you get an even faster ET) along with CSS and XPath selectors (the state of node selection in ElementTree is… not always awesome), and it has an HTML parser so you can use it for screen scraping as well.
@Michael DeHaan
> Didn’t like http://www.crummy.com/software/BeautifulSoup/documentation.html ?
BeautifulSoup is not an XML parser or library (plus it’s slow, it was ok for screen scraping until 3.1 but for heavy XML work it just isn’t up to snuff. Even with BeautifulStoneSoup. Plus ET’s API is better)
Comment by masklinn — 2009-07-23 @ 13:51
And it’s even faster if you use xml.etree.cElementTree. BTW, your website is broken, the text is not displayed.
Comment by Julian Andres Klode — 2009-07-23 @ 13:59
Julian,
not sure what you mean ? Which page is broken ?
Comment by Thomas — 2009-07-23 @ 14:14
If you like xml.etree.ElementTree, you should try lxml. http://codespeak.net/lxml/
Comment by Marcin — 2009-07-23 @ 14:35
> not sure what you mean ? Which page is broken ?
When reaching your site from Planet Python’s feed read through google reader, one lands on http://thomas.apestaart.org/log/?p=962 but neither the text nor the comment are displayed (screenie: http://hfr-rehost.net/http://self/pic/159d63f5b73033dae3b7437cc116618a5fa6c5f6.png). I have to click on the title to get the text, and I guess so does jak.
Comment by masklinn — 2009-07-23 @ 14:43
@masklinn: well, I double-checked. The page’s HTML validates perfectly. It looks like you’re using Safari. Any chance you can try it with a different browser to narrow the problem down ?
Comment by Thomas — 2009-07-23 @ 16:46
I hate ElementTree with fiery hot passion of a thousand suns. It’s fine for a very limited subset of XML (“data-oriented” XML, or in other words, formats that maybe shouldn’t have been in XML in the first place), but worthless for “document” XML, that is XML with mixed content. ElementTree is willfully ignorant of mixed content (don’t tell me about .tail, it’s a ridiculous, stupid hack).
That would be OK if everyone understood that ET is only about an XML subset. But every time I complain about Python’s lousy XML handling in the std lib, and pine for LibXML, someone starts talking about ET, as if it was a general solution for working with XML. It’s not.
Comment by Matt Chaput — 2009-07-23 @ 15:06
cElementTree is also one of the most memory efficient tree-based XML interfaces for Python. For some large processing jobs I’ve done, it was the only one that could finish the job without swapping. So it saved me from having to rewrite for an event based processor.
Paula: if you use ElementTree’s iterparse() mode, you can parse large documents by removing nodes from the tree as it finishes parsing them. This pretty much lets you parse the document in arbitrary sized chunks.
Comment by James Henstridge — 2009-07-23 @ 15:43
+1 lxml.
It provides the API of many popular Python XML libraries, but is based on libxml2 for correctness and speed.
Comment by Davyd — 2009-07-23 @ 16:10
I’m also a fan of BeautifulSoup
Comment by rgz — 2009-07-23 @ 16:18
@masklinn: The problem is your browser. For some reason, the page renders as you see with Safari (and presumably other WebKit browsers), but displays fine under Firefox.
I also find this very annoying, but lack the time/inclination to dig into why the page renders differently between the two browsers.
Comment by Jonathan Pryor — 2009-07-23 @ 16:33
take a look a this -> http://git.gnome.org/cgit/libgeexml
Comment by Roberto Majadas — 2009-07-23 @ 16:42
+1 lxml
Check it out.
Comment by Keegan — 2009-07-23 @ 16:45
> @masklinn: The problem is your browser. For some reason, the page renders as you see with Safari (and presumably other WebKit browsers), but displays fine under Firefox.
Wrong. If only because my browser is Camino (2.0b4pre) not Safari (Safari doesn’t have a status bar at the bottom of the screen). That doesn’t matter anyway: the behavior is exactly the same with Safari 4.0.2, Webkit r46237 (the latest revision as of this post), Opera 10.0 beta 2 and Firefox 3.5.1
> @masklinn: well, I double-checked. The page’s HTML validates perfectly.
The issue is not with the markup so much as the lack of markup: I checked the DOM in the web inspector (in safari but I expect it’s the same in Firefox) and the content is simply *not there*. Not invalid or incorrect, non-existent.
> It looks like you’re using Safari. Any chance you can try it with a different browser to narrow the problem down ?
I wasn’t using Safari (Camino), but see above behavior is the same whether it’s Safari, Camino, Firefox or Opera (I could do a screencast to prove it but it wouldn’t exactly be useful). And the issue only seems to crop up when coming from Google Reader. I’d expect there’s an issue somewhere in the content generation (wordpress? A plugin?) where it gets partially disabled on some referers (if I copy the URL from Google Reader in the URL bar and validate — preventing referer transmission — the content appears, if I click on the Google Reader link the content doesn’t).
I just checked in Firefox’s Live HTTP Headers extension the only differences are the “Referer: http://www.google.com/reader/view/” header in one case and not the other, and the values for the Date and Expires headers.
Comment by masklinn — 2009-07-23 @ 17:30
@Thomas: I also think of XML documents as a big dictionary, which is the DOM approach to XML parsing (amirite?). However, while writing my own project that uses XML, I chose to use a SAX parser because I was concerned about memory usage. Thanks to this blog entry I’ll reconsider that (especially because the XML documents I’m dealing with are usually not that large).
@Michael, rgz: BeautifulSoup was a great library, but the author no longer enjoys maintaining it.
Comment by Kurt McKee — 2009-07-23 @ 17:33
lxml
Its the only Python xml parsing library you need.
I promise. :-)
Comment by Michael Schurter — 2009-07-23 @ 18:38
I strongly recommend PyQuery: http://pypi.python.org/pypi/pyquery
It’s heavily inspired by the jQuery javascript framework and shares with it the same power and simpleness.
Under the cover is based on lxml which, in turn, is based on libxml, so you also get excellent performances.
Comment by Emanuele Aina — 2009-07-23 @ 21:34
I’m always surprised at how few people know about Amara which has such a lovely, straightforward and powerful API.
http://xml3k.org/Amara2/Seven_days
http://wiki.xml3k.org/Amara2/Whatsnew
That’s the one I always use personally, much much friendlier than ET and lxml in my book. But then again, not everyone’s taste perhaps.
Comment by Sylvain Hellegouarch — 2009-07-23 @ 21:39
I used to use ElementTree, but recently have been converted to Amara . It’s very impressive, and so far the only Python library I’ve found that supports a significant amount of XPath.
Comment by John Millikin — 2009-07-23 @ 22:25
Your blog software ate the link: http://xml3k.org/Amara2
Comment by John Millikin — 2009-07-23 @ 22:25
@Matt: If you ould tell us 1000 reasons why lxml sucks, then do it please! Your comment here is something I would almost call trolling! What is “document” XML? If one writes lots of text bewteen Tags? And even that is it – whatÄs the problem with lxml to handle taht and what are the better alternatives in your opinion?
Comment by Christian — 2009-07-24 @ 16:30
@Christian: umm, hello? I was criticizing the ElementTree API, not lxml. Document XML is XML marking up a document, ie lots of mixed content, rather than XML that’s just organizing data, ie no mixed content. The ET API is worthless for working with mixed content. If you don’t even know the subject, you might want to avoid commenting on it.
Comment by Matt Chaput — 2009-07-24 @ 21:20
ElementTree may well be useless for working with mixed XML, but it’s very well-designed for dealing with your typical XML-based data storage format. In fact, the disregard for mixed content, if anything, makes it better for that.
Comment by makomk — 2009-07-29 @ 18:39
Matt is right. The ET API *is* worthless for working with mixed content.
Comment by David — 2009-07-30 @ 06:27
The other option that may be worth looking into is vtd-xml
http://vtd-xml.sf.net
http://www.developer.com/java/other/article.php/3714051
Comment by Jimmy Zhang — 2009-08-17 @ 21:45
etree has been giving me all kinds of grief dealing with mixed content XML. I hate to have to be the one to break this to people, but XML is absolutely AMAZINGLY well suited for mixed content. Having something that implements DOM as it was intended would be ideal. Even .NET handles mixed content XML better than etree!
Comment by Kurt — 2010-04-02 @ 21:51
Folks look at XML through the lens of applications they have worked on.
XML originated with the document process folks — who developed SGML, which begat XML — who developed the concept of markup vocabulary to ‘structure’ text documents so they could be processed through persons-in-the-loop workflow chain of human process (editorial and clerical staff reading, interpeting and adding value via markup) and automation (extraction, compilation, augumentation, composition …).
The XML originators were quickly joined by the IT folks who looked upon it as a universal serialization standard for hierarchical data — a different lens, with a much larger technical user population.
A key aspect of much document processing is that the document design was driven by human usage considerations, with little, if any thought given to automated processing; thus you have ‘mixed content’ when structure is imposed on the document.
In the classical IT environment, where documents are designed from the beginning as an integral part of an automated system, ‘mixed content’ is most often avoided.
So, your application design environment drives your choice of tools. I appreciate the comments regarding what folks have learned in applying the Element Tree paradigm to the formal, automated design ‘document’, and when applying it to ‘document not driven by automation consideration (aka mixed content)’. Do both types of application a few times, and you will come to recognize the different roles xml markup might play in your systems architecture.
Comment by techArtist — 2010-06-20 @ 18:25