Parsing XML in Python. SAX XML and DOM XML Parsing Using Python
Parsing XML using Python. Will use the SAX ( Simple API for XML) parser and DOM ( Document Object Model ) parser
Table of contents
XML
Simply speaking XML (Extensible Markup Language) is a markup language similar to HTML, but without predefined tags to use. This is a powerful way to store data and structure data. Although, JSON is the new adapted standard for transmitting data in webapis, Not long ago, XML was a famous choice.
We will see how you can accept data from a xml file and parse the content using python.
SAX Parser
SAX (Simple API for XML) is an event-based parser for XML documents. Unlike a DOM parser, a SAX parser creates no parse tree. AX parsers read XML node by node, issuing parsing events while making a step through the input stream.
Let's take a sample XML file.
<?xml version="1.0"?>
<class>
<student>
<id>101</id>
<name>John</name>
<department>Engineering </department>
<subject>Mathematics</subject>
<marks>83</marks>
<promoted>yes</promoted>
</student>
<student>
<id>102</id>
<name>Kapil</name>
<department>Engineering </department>
<subject>Chemistry</subject>
<marks>60</marks>
<promoted>yes</promoted>
</student>
<student>
<id>103</id>
<name>Harsh</name>
<department>Engineering </department>
<subject>English</subject>
<marks>70</marks>
<promoted>yes</promoted>
</student>
<student>
<id>104</id>
<name>Jite</name>
<department>Engineering </department>
<subject>Physics</subject>
<marks>76</marks>
<promoted>yes</promoted>
</student>
</class>
- We create a python file and import
xml
to parse data.
import xml.sax
- For parsing XML with SAX, we need to create a class to override the default
contextHandler
, which has certain methods that get called on certain events likeelementStart
,elementEnd
.
if __name__ == "__main__":
# create an XMLReader
parser = xml.sax.make_parser()
# a class that override the default ContextHandler
Handler = StudentHandler()
parser.setContentHandler(Handler)
# make sure your file name mathces here
# it should also be in the same location
parser.parse("XMLFile.xml")
Here we have our StudentHandler
class which overrides certain methods whose names are pretty self explanatory.
class StudentHandler(xml.sax.ContentHandler):
def __init__(self):
self.CurrentTag = ""
self.id = ""
self.name = ""
self.department = ""
self.subject = ""
self.marks = ""
self.promoted = ""
# Called when an element starts
def startElement(self, tag, attributes):
self.CurrentTag = tag
if (tag == "student"):
print("**** New Student Data ****")
# Called when an elements ends
def endElement(self, tag):
if self.CurrentTag == "id":
print("ID : ", self.id)
elif self.CurrentTag == "name":
print("Name :", self.name)
elif self.CurrentTag == "department":
print("Department :", self.department)
elif self.CurrentTag == "subject":
print("Subject:", self.subject)
elif self.CurrentTag == "marks":
print("Marks:", self.marks)
elif self.CurrentTag == "promoted":
print("Promoted :", self.promoted)
self.CurrentTag = ""
# Called when a character is read
def characters(self, content):
if self.CurrentTag == "id":
self.id = content
elif self.CurrentTag == "name":
self.name = content
elif self.CurrentTag == "department":
self.department = content
elif self.CurrentTag == "subject":
self.subject = content
elif self.CurrentTag == "marks":
self.marks = content
elif self.CurrentTag == "promoted":
self.promoted = content
- Run this and this gives something as this as an output.
**** New Student Data ****
ID : 101
Name : John
Department : Engineering
Subject: Mathematics
Marks: 83
Promoted : yes
**** New Student Data ****
ID : 102
Name : Kapil
Department : Engineering
Subject: Chemistry
Marks: 60
Promoted : yes
**** New Student Data ****
ID : 103
Name : Harsh
Department : Engineering
Subject: English
Marks: 70
Promoted : yes
**** New Student Data ****
ID : 104
Name : Jite
Department : Engineering
Subject: Physics
Marks: 76
Promoted : yes
DOM Parser
We import the minidom from xml, and load the entire XML file, parse it and store it in doc variable. Since we have a list of students, we get all the students and iterate over it. After that , it's just some fancy code to print that in tabular format.
from xml.dom import minidom
doc = minidom.parse("XMLFile.xml")
students = doc.getElementsByTagName("student")
print("\t\t***** User Data ****")
print("\t Id \t Name \t Department \t Subject \t Marks \t Promoted")
for student in students:
sid = student.getElementsByTagName("id")[0]
# print(sid)
nickname = student.getElementsByTagName("name")[0]
lastNAme = student.getElementsByTagName("department")[0]
subject = student.getElementsByTagName("subject")[0]
marks = student.getElementsByTagName("marks")[0]
promoted = student.getElementsByTagName("promoted")[0]
print("\t %s \t %s \t %s \t %s \t %s \t %s" % (
sid.firstChild.data,
nickname.firstChild.data,
lastNAme.firstChild.data,
subject.firstChild.data,
marks.firstChild.data,
promoted.firstChild.data,
))
***** User Data ****
Id Name Department Subject Marks Promoted
101 John Engineering Mathematics 83 yes
102 Kapil Engineering Chemistry 60 yes
103 Harsh Engineering English 70 yes
104 Jite Engineering Physics 76 yes
And that's how we have parsed XML in python , using both SAX and DOM.