Sonntag, 2. November 2014

Parse XML in Ruby with Nokogiri gem!

The Ruby gem 'Nokogiri' is the quasi standard for parsing XML and HTML in the Ruby community.
The official Nokogiri web page explains how to install it on various platforms.
An XML file like the catalog.xml:



  
  
  
  
  

can be loaded and parsed like:
file_path = '/usr/share/xml/docbook/schema/dtd/catalog.xml'
xml = Nokogiri::XML(File.read file_path)
=> #<Nokogiri::XML::Document:0xfda908 name="document"...>
There are also several XML/ HTML parse options available, for example eliminating all blank nodes:
xml = Nokogiri::XML File.read(file_path), &:noblanks
=> #<Nokogiri::XML::Document:0xfda908 name="document">
The returned Nokogiri object provides a lot ways of accessing the XML nodes.

Basic Ruby API

The object can be grasped with pure Ruby like:
xml.children.children.select { |node| 
  node.name.eql? 'system' 
}.map{ |node| 
  node.attributes["systemId"].value 
}
=> ["http://oasis-open.org/docbook/xml/4.1/docbookx.dtd", 
"http://docbook.org/xml/4.1/docbookx.dtd"]
which often times is cumbersome and verbose. That is why there are other APIs.

Basic XPath API

One of them is XPath:
xml.xpath("//xmlns:catalog/xmlns:system/@systemId")
=> [#<Nokogiri::XML::Attr:0xf4ea20 name="systemId" value="http://oasis-open.org/docbook/xml/4.1/docbookx.dtd">, 
#<Nokogiri::XML::Attr:0xf4e110 name="systemId" value="http://docbook.org/xml/4.1/docbookx.dtd">]
which has to be combined with Ruby to get the values too:
xml.xpath("//xmlns:catalog/xmlns:system/@systemId").
  map(&:value)
=> ["http://oasis-open.org/docbook/xml/4.1/docbookx.dtd", 
"http://docbook.org/xml/4.1/docbookx.dtd"]
Dealing with namespaces in many use cases is annoying and can be removed:
xml.remove_namespaces!.
  xpath("//catalog/system/@systemId").map(&:value)
=> ["http://oasis-open.org/docbook/xml/4.1/docbookx.dtd", 
"http://docbook.org/xml/4.1/docbookx.dtd"]
The entire XPath reference is documented at w3school.

Basic CSS API

The third kind of accessing the XML node tree is CSS. The full CSS selector reference also is documented at w3school, but receiving XML nodes works like:
xml.css("catalog > system")
=> [#<Nokogiri::XML::Element:0xf4ea84 name="system" attributes=[
  #<Nokogiri::XML::Attr:0xf4ea20 name="systemId" value="http://www.oasis-open.org/docbook/xml/4.1/docbookx.dtd">
]>, 
#<Nokogiri::XML::Element:0xf4e174 name="system" attributes=[
  #<Nokogiri::XML::Attr:0xf4e110 name="systemId" value="http://docbook.org/xml/4.1/docbookx.dtd">
]
With the CSS API only the nodes can be grasped. When dealing with their attributes is required Ruby API comes into play:
xml.css("catalog > system").map{ |node| 
  node.attributes['systemId'].value 
}
=> ["http://oasis-open.org/docbook/xml/4.1/docbookx.dtd", 
"http://docbook.org/xml/4.1/docbookx.dtd"]


Nokogiri can handle HTML documents as well.
Grasping the home page document from Nokogiri with curl (read Tell shell scripting apart in Ruby!) and finding the text of the first tutorials link is easy as:
html = Nokogiri::HTML(%x[curl http://www.nokogiri.org/])
html.css('.page ol li a').first.text

Further articles of interest:

Supported by Ruby 2.1.1