Parsing XML in Python. SAX XML and DOM XML Parsing Using Python

Parsing XML in Python. SAX XML and DOM XML Parsing Using Python

Parsing XML using Python. Will use the SAX ( Simple API for XML) parser and DOM ( Document Object Model ) parser

Table of contents

XML

Simply speaking XML (Extensible Markup Language) is a markup language similar to HTML, but without predefined tags to use. This is a powerful way to store data and structure data. Although, JSON is the new adapted standard for transmitting data in webapis, Not long ago, XML was a famous choice.

We will see how you can accept data from a xml file and parse the content using python.

SAX Parser

SAX (Simple API for XML) is an event-based parser for XML documents. Unlike a DOM parser, a SAX parser creates no parse tree. AX parsers read XML node by node, issuing parsing events while making a step through the input stream.

Let's take a sample XML file.

<?xml version="1.0"?>  
<class>  
    <student>  
        <id>101</id>  
        <name>John</name>
        <department>Engineering </department>
        <subject>Mathematics</subject>  
        <marks>83</marks>
        <promoted>yes</promoted>
    </student>  

    <student>  
        <id>102</id>  
        <name>Kapil</name>
        <department>Engineering </department>
        <subject>Chemistry</subject>  
        <marks>60</marks>
        <promoted>yes</promoted>
    </student>  

    <student>  
        <id>103</id>  
        <name>Harsh</name>  
        <department>Engineering </department>
        <subject>English</subject>  
        <marks>70</marks>
        <promoted>yes</promoted>
    </student>  

    <student>  
        <id>104</id>  
        <name>Jite</name>  
        <department>Engineering </department>
        <subject>Physics</subject>  
        <marks>76</marks>
        <promoted>yes</promoted>
    </student>  

</class>
  • We create a python file and import xml to parse data.
import xml.sax
  • For parsing XML with SAX, we need to create a class to override the default contextHandler, which has certain methods that get called on certain events like elementStart, elementEnd.
if __name__ == "__main__":
    # create an XMLReader
    parser = xml.sax.make_parser()
    # a class that override the default ContextHandler
    Handler = StudentHandler()
    parser.setContentHandler(Handler)
    # make sure your file name mathces here
    # it should also be in the same location
    parser.parse("XMLFile.xml")

Here we have our StudentHandler class which overrides certain methods whose names are pretty self explanatory.

class StudentHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.CurrentTag = ""
        self.id = ""
        self.name = ""
        self.department = ""
        self.subject = ""
        self.marks = ""
        self.promoted = ""

    # Called when an element starts
    def startElement(self, tag, attributes):
        self.CurrentTag = tag
        if (tag == "student"):
            print("**** New Student Data ****")

    # Called when an elements ends
    def endElement(self, tag):
        if self.CurrentTag == "id":
            print("ID : ", self.id)
        elif self.CurrentTag == "name":
            print("Name :", self.name)
        elif self.CurrentTag == "department":
            print("Department :", self.department)
        elif self.CurrentTag == "subject":
            print("Subject:", self.subject)
        elif self.CurrentTag == "marks":
            print("Marks:", self.marks)
        elif self.CurrentTag == "promoted":
            print("Promoted :", self.promoted)
        self.CurrentTag = ""


    # Called when a character is read
    def characters(self, content):
        if self.CurrentTag == "id":
            self.id = content
        elif self.CurrentTag == "name":
            self.name = content
        elif self.CurrentTag == "department":
            self.department = content
        elif self.CurrentTag == "subject":
            self.subject = content
        elif self.CurrentTag == "marks":
            self.marks = content
        elif self.CurrentTag == "promoted":
            self.promoted = content
  • Run this and this gives something as this as an output.
**** New Student Data ****
ID :  101
Name : John
Department : Engineering
Subject: Mathematics
Marks: 83
Promoted : yes
**** New Student Data ****
ID :  102
Name : Kapil
Department : Engineering
Subject: Chemistry
Marks: 60
Promoted : yes
**** New Student Data ****
ID :  103
Name : Harsh
Department : Engineering
Subject: English
Marks: 70
Promoted : yes
**** New Student Data ****
ID :  104
Name : Jite
Department : Engineering
Subject: Physics
Marks: 76
Promoted : yes

DOM Parser

We import the minidom from xml, and load the entire XML file, parse it and store it in doc variable. Since we have a list of students, we get all the students and iterate over it. After that , it's just some fancy code to print that in tabular format.

from xml.dom import minidom

doc = minidom.parse("XMLFile.xml")

students = doc.getElementsByTagName("student")

print("\t\t*****    User Data    ****")
print("\t Id \t Name \t Department \t Subject \t Marks \t Promoted")
for student in students:
    sid = student.getElementsByTagName("id")[0]
    # print(sid)
    nickname = student.getElementsByTagName("name")[0]
    lastNAme = student.getElementsByTagName("department")[0]
    subject = student.getElementsByTagName("subject")[0]
    marks = student.getElementsByTagName("marks")[0]
    promoted = student.getElementsByTagName("promoted")[0]
    print("\t %s \t %s \t %s \t %s \t %s \t %s" % (
        sid.firstChild.data,
        nickname.firstChild.data,
        lastNAme.firstChild.data,
        subject.firstChild.data,
        marks.firstChild.data,
        promoted.firstChild.data,
    ))
                *****    User Data    ****
         Id      Name    Department      Subject         Marks   Promoted
         101     John    Engineering     Mathematics     83      yes
         102     Kapil   Engineering     Chemistry       60      yes
         103     Harsh   Engineering     English         70      yes
         104     Jite    Engineering     Physics         76      yes

And that's how we have parsed XML in python , using both SAX and DOM.