Friday, March 7, 2008

PHP and SimpleXML namespaces

Last days i did some web scraping with PHP and decided to use simplexml's xpath feature to get it done. But the xpath stuff caused massive headache. The problem: I wasn't able to execute any xpath expression on the simplexml document. Not any! Because i had no idea what the problem is, i googled a little bit and found some hints. The xpath engine need to know about namespaces (registerXPathNamespace to the help) and isn't able to use default namespace.
So what went wrong? As a orderly guy i've used tidy to make the maybe not well formed xhtml document parseable for simplexml. Of course i've exported xhtml with tidy and pedantic tidy added namespaces in this case.
Obviously the solution was simply to use tidy with output-html instead of output-xhtml option.

I probably never encountered this problem if i've used this little workaround with domxml::loadHTML:

$dom = DOMDocument::loadHTML($html);
$page = simplexml_import_dom($dom);

I guess the dom function will fix most common issues with broken html.

No comments: