Home How do I parse XML in Python?
Reply: 15

How do I parse XML in Python?

randombits
1#
randombits Published in 2009-12-16 05:09:24Z

I have many rows in a database that contains xml and I'm trying to write a Python script that will go through those rows and count how many instances of a particular node attribute show up. For instance, my tree looks like:

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

How can I access the attributes 1 and 2 in the XML using Python?

Daniel F
2#
Daniel F Reply to 2017-12-10 20:11:15Z

I suggest ElementTree. There are other compatible implementations of the same API, such as lxml, and cElementTree in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree defines.

After building an Element instance e from the XML, e.g. with the XML function, or by parsing a file with something like

import xml.etree.ElementTree
e = xml.etree.ElementTree.parse('thefile.xml').getroot()

or any of the many other ways shown at ElementTree, you just do something like:

for atype in e.findall('type'):
    print(atype.get('foobar'))

and similar, usually pretty simple, code patterns.

Dimitrios Paraschas
3#
Dimitrios Paraschas Reply to 2015-04-08 16:16:50Z

You can use BeautifulSoup

from bs4 import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'
Thijs de Z
4#
Thijs de Z Reply to 2012-11-15 13:12:15Z

I'm still a Python newbie myself, but my impression is that ElementTree is the state-of-the-art in Python XML parsing and handling.

Mark Pilgrim has a good section on Parsing XML with ElementTree in his book Dive Into Python 3.

Tor Valamo
5#
Tor Valamo Reply to 2009-12-16 08:47:08Z

Python has an interface to the expat xml parser.

xml.parsers.expat

It's a non-validating parser, so bad xml will not be caught. But if you know your file is correct, then this is pretty good, and you'll probably get the exact info you want and you can discard the rest on the fly.

stringofxml = """<foo>
    <bar>
        <type arg="value" />
        <type arg="value" />
        <type arg="value" />
    </bar>
    <bar>
        <type arg="value" />
    </bar>
</foo>"""
count = 0
def start(name, attr):
    global count
    if name == 'type':
        count += 1

p = expat.ParserCreate()
p.StartElementHandler = start
p.Parse(stringofxml)

print count # prints 4
EMP
6#
EMP Reply to 2009-12-16 05:28:55Z

I find the Python xml.dom and xml.dom.minidom quite easy. Keep in mind that DOM isn't good for large amounts of XML, but if your input is fairly small then this will work fine.

Martin Thoma
7#
Martin Thoma Reply to 2015-02-23 10:22:41Z

minidom is the quickest and pretty straight forward:

XML:

<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>

PYTHON:

from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print(len(itemlist))
print(itemlist[0].attributes['name'].value)
for s in itemlist:
    print(s.attributes['name'].value)

OUTPUT

4
item1
item1
item2
item3
item4
sandy
8#
sandy Reply to 2013-06-07 08:15:54Z

lxml.objectify is really simple.

Taking your sample text:

from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
    count[item.attrib.get("foobar")] += 1

print dict(count)

Output:

{'1': 1, '2': 1}
Jan Kohila
9#
Jan Kohila Reply to 2013-07-09 20:09:34Z

Here a very simple but effective code using cElementTree.

try:
    import cElementTree as ET
except ImportError:
  try:
    # Python 2.5 need to import a different module
    import xml.etree.cElementTree as ET
  except ImportError:
    exit_err("Failed to import cElementTree from any known place")      

def find_in_tree(tree, node):
    found = tree.find(node)
    if found == None:
        print "No %s in file" % node
        found = []
    return found  

# Parse a xml file (specify the path)
def_file = "xml_file_name.xml"
try:
    dom = ET.parse(open(def_file, "r"))
    root = dom.getroot()
except:
    exit_err("Unable to open and parse input definition file: " + def_file)

# Parse to find the child nodes list of node 'myNode'
fwdefs = find_in_tree(root,"myNode")

Source:

http://www.snip2code.com/Snippet/991/python-xml-parse?fromPage=1

Cyrus
10#
Cyrus Reply to 2016-08-18 23:23:43Z

There are many options out there. cElementTree looks excellent if speed and memory usage are an issue. It has very little overhead compared to simply reading in the file using readlines.

The relevant metrics can be found in the table below, copied from the cElementTree website:

library                         time    space
xml.dom.minidom (Python 2.1)    6.3 s   80000K
gnosis.objectify                2.0 s   22000k
xml.dom.minidom (Python 2.4)    1.4 s   53000k
ElementTree 1.2                 1.6 s   14500k  
ElementTree 1.2.4/1.3           1.1 s   14500k  
cDomlette (C extension)         0.540 s 20500k
PyRXPU (C extension)            0.175 s 10850k
libxml2 (C extension)           0.098 s 16000k
readlines (read as utf-8)       0.093 s 8850k
cElementTree (C extension)  --> 0.047 s 4900K <--
readlines (read as ascii)       0.032 s 5050k   

As pointed out by @j-f-sebastian, cElementTree comes bundled with python: "On Python 2: from xml.etree import cElementTree as ElementTree. On Python 3: from xml.etree import ElementTree (the accelerated C version is used automatically)."

myildirim
11#
myildirim Reply to 2014-06-13 07:02:11Z

I suggest xmltodict for simplicity.

It parses your xml to an OrderedDict;

>>> e = '<foo>
             <bar>
                 <type foobar="1"/>
                 <type foobar="2"/>
             </bar>
        </foo> '

>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result

OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])

>>> result['foo']

OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])

>>> result['foo']['bar']

OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])
Radhamohan Parashar
12#
Radhamohan Parashar Reply to 2016-05-20 09:37:50Z

rec.xml:-

<?xml version="1.0"?>
<nodes>
    <node name="Car" child="Engine"></node>
    <node name="Engine" child="Piston"></node>
    <node name="Engine" child="Carb"></node>
    <node name="Car" child="Wheel"></node>
    <node name="Wheel" child="Hubcaps"></node>
    <node name="Truck" child="Engine"></node>
    <node name="Truck" child="Loading Bin"></node>
    <node name="Piston" child="Loa"></node>
    <node name="Spare Wheel" child=""></node>
</nodes>

par.py:-

import xml.etree.ElementTree as ET
tree = ET.parse('rec.xml')
root = tree.getroot()

for nodes in root.findall('node'):
    parent = nodes.attrib.get('name')
    child = nodes.attrib.get('child')
    print parent,child
Souvik Dey
13#
Souvik Dey Reply to 2017-02-20 15:56:58Z
import xml.etree.ElementTree as ET
data = '''<foo>
           <bar>
               <type foobar="1"/>
               <type foobar="2"/>
          </bar>
       </foo>'''
tree = ET.fromstring(data)
lst = tree.findall('bar/type')
for item in lst:
    print item.get('foobar')

This will print the value of foobar attribute.

jessag
14#
jessag Reply to 2017-03-17 09:23:13Z

Just to add another possibility, you can use untangle, as it is a simple xml-to-python-object library. Here you have an example:

Installation

pip install untangle

Usage

Your xml file (a little bit changed):

<foo>
   <bar name="bar_name">
      <type foobar="1"/>
   </bar>
</foo>

accessing the attributes with untangle:

import untangle

obj = untangle.parse('/path_to_xml_file/file.xml')

print obj.foo.bar['name']
print obj.foo.bar.type['foobar']

the output will be:

bar_name
1

More information about untangle can be found here.
Also (if you are curious), you can find a list of tools for working with XML and Python here (you will also see that the most common ones were mentioned by previous answers).

Artem Bernatskyi
15#
Artem Bernatskyi Reply to 2017-08-25 12:24:06Z

Are you serious ?

What about security concerns ? Use defusedxml .

This is also recommended by Two Scoops of Django .

Comparison about defusedxml vs other libraries

Lxml is protected against billion laughs attacks and doesn't do network lookups by default.

libxml2 and lxml are not directly vulnerable to gzip decompression bombs but they don't protect you against them either.

xml.etree doesn't expand entities and raises a ParserError when an entity occurs.

minidom doesn't expand entities and simply returns the unexpanded entity verbatim.

genshi.input of genshi 0.6 doesn't support entity expansion and raises a ParserError when an entity occurs.

Library has (limited) XInclude support but requires an additional step to process inclusion.

gatkin
16#
gatkin Reply to 2017-09-04 17:40:26Z

I might suggest declxml.

Full disclosure: I wrote this library because I was looking for a way to convert between XML and Python data structures without needing to write dozens of lines of imperative parsing/serialization code with ElementTree.

With declxml, you use processors to declaratively define the structure of your XML document and how to map between XML and Python data structures. Processors are used to for both serialization and parsing as well as for a basic level of validation.

Parsing into Python data structures is straightforward:

import declxml as xml

xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.dictionary('bar', [
        xml.array(xml.integer('type', attribute='foobar'))
    ])
])

xml.parse_from_string(processor, xml_string)

Which produces the output:

{'bar': {'foobar': [1, 2]}}

You can also use the same processor to serialize data to XML

data = {'bar': {
    'foobar': [7, 3, 21, 16, 11]
}}

xml.serialize_to_string(processor, data, indent='    ')

Which produces the following output

<?xml version="1.0" ?>
<foo>
    <bar>
        <type foobar="7"/>
        <type foobar="3"/>
        <type foobar="21"/>
        <type foobar="16"/>
        <type foobar="11"/>
    </bar>
</foo>

If you want to work with objects instead of dictionaries, you can define processors to transform data to and from objects as well.

import declxml as xml

class Bar:

    def __init__(self):
        self.foobars = []

    def __repr__(self):
        return 'Bar(foobars={})'.format(self.foobars)


xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.user_object('bar', Bar, [
        xml.array(xml.integer('type', attribute='foobar'), alias='foobars')
    ])
])

xml.parse_from_string(processor, xml_string)

Which produces the following output

{'bar': Bar(foobars=[1, 2])}
You need to login account before you can post.

About| Privacy statement| Terms of Service| Advertising| Contact us| Help| Sitemap|
Processed in 0.313143 second(s) , Gzip On .

© 2016 Powered by mzan.com design MATCHINFO