Wednesday, 7 December 2016

36 C# Using XPath

Using XPath with the XmlDocument class

In a previous chapter, we used the XmlDocument class to get out information from an XML file. We did it by using a range of calls to the ChildNodes property, which was somewhat simple because the example was very simple. It didn't do much good for the readability of our code though, so in this chapter we will look at another approach, which is definitely more powerful and yet easier to read and maintain. The technology we will be using for this is called XPath and is maintained by the same organization which created the XML standard. XPath is actually an entire query language, with lots possibilities, but since this is not an XPath tutorial, we will only look into some basic queries. However, even in its simplest forms, XPath is still powerful, as you will see in the following examples. 

The XmlDocument class has several methods which takes an XPath query as a parameter and then returns the resulting XmlNode(s). In this chapter we will look into two methods: The SelectSingleNode() method, which returns a single XmlNode based on the provided XPath query, and the SelectNodes() method, which returns a XmlNodeList collection of XmlNode objects based on the provided XPath query. 

We will try both of the above mentioned methods, but instead of using the currency information XML we tested in previous chapters, we will now try a new XML source. RSS feeds are essentially XML documents built in a specific way, to allow for a load of different news readers to parse and show the same information in their own way. 

We will use an RSS feed from CNN, located at http://rss.cnn.com/rss/edition_world.rss, with news from across the world. If you open it in your browser, your browser may render this in a nice way, allowing you to get an overview of the feed and subscribe to it, but don't get fooled by that: Under the hood, it's just XML, which you will see if you do a "View source" in your browser. You will see that the root element is called "rss". The rss element usually has one or several "channel" elements, and within this element, we can find information about the feed as well as the "item" nodes, which are the news items we usually want. 

In the following example, we will use the SelectSingleNode() method to get the title of the feed. If you look at the XML, you will see that there is a <title> element as a child element of the <channel> element, which is then a child element of the <rss> element, the root. That query can be described like this in XPath: 

//rss/channel/title 

We simply write the names of the element we're looking for, separated with a forward-slash (/), which states that the element should be a child to the element before the preceeding forward-slash. Using this XPath is as simple as this:
using System;
using System.Text;
using System.Xml;

namespace ParsingXml
{
    class Program
    {
        static void Main(string[] args)
        {
            XmlDocument xmlDoc = new XmlDocument();
            xmlDoc.Load("http://rss.cnn.com/rss/edition_world.rss");
            XmlNode titleNode = xmlDoc.SelectSingleNode("//rss/channel/title");
            if(titleNode != null)
                Console.WriteLine(titleNode.InnerText);
            Console.ReadKey();   
        }
    }
}
We use the SelectSingleNode() method to locate the <title> element, which simply takes our XPath as a string parameter. We then check to make sure that it returned a result, and if it did, we print the InnerText of the located node, which should be the title of the RSS feed. 

In the next example, we will use the SelectNodes() method to find all the item nodes in the RSS feed and then print out information about them:
using System;
using System.Text;
using System.Xml;

namespace ParsingXml
{
    class Program
    {
        static void Main(string[] args)
        {
            XmlDocument xmlDoc = new XmlDocument();
            xmlDoc.Load("http://rss.cnn.com/rss/edition_world.rss");
            XmlNodeList itemNodes = xmlDoc.SelectNodes("//rss/channel/item");
            foreach(XmlNode itemNode in itemNodes)
            {
                XmlNode titleNode = itemNode.SelectSingleNode("title");
                XmlNode dateNode = itemNode.SelectSingleNode("pubDate");
                if((titleNode != null) && (dateNode != null))
                    Console.WriteLine(dateNode.InnerText + ": " + titleNode.InnerText);
            }
            Console.ReadKey();   
        }
    }
}
The SelectNodes() method takes an XPath query as a string, just like we saw in the previous example, and then it returns a list of XmlNode objects in a XmlNodeList collection. We iterate through it with a foreach loop, and from each of the item nodes, we ask for a child node called title and pubDate (published date) using the SelectSingleNode() directly on the item node. If we get both of them, we print out the date and the title on the same line and then move on.

In our example, we wanted two different values from each item node, which is why we asked for the item nodes and then processed each of them. However, if we only needed the e.g. the titles of each item, we could change the XPath query to something like this: 

//rss/channel/item/title 

It will match each title node in each of the item nodes. Here's the query with some C# code to make it all happen:
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("http://rss.cnn.com/rss/edition_world.rss");
XmlNodeList titleNodes = xmlDoc.SelectNodes("//rss/channel/item/title");
foreach(XmlNode titleNode in titleNodes)
    Console.WriteLine(titleNode.InnerText);            
Console.ReadKey();

No comments:

Post a Comment