]> The LambCutlet Disorganisation » WordPress & “application/xhtml+xml” take 2

The LambCutlet Disorganisation

WordPress & “application/xhtml+xml” take 2

Posted by Jonathan at 20:00:05 UTC on the 11th of May, 2005

This time round is a slightly extended version of a WordPress hack from me and now works for the current 1.5.x line to date. The one major new functionality is the ability to force a MIME-type via GET, which some may find handy. The changes are fairly trivial:

In ./wp-blog-header.php, find…


// Sending HTTP headers
	
if ( !empty($error) && '404' == $error ) {
    if ( preg_match('/cgi/', php_sapi_name()) )
        @header('Status: 404 Not Found');
    else
        @header('HTTP/1.x 404 Not Found');
} else if ( empty($feed) ) {
    @header('X-Pingback: '. get_bloginfo('pingback_url'));
    @header('Content-type: ' . get_bloginfo('html_type') . '; charset=' . get_option('blog_charset'));

Replace with…


if(     (strstr($_SERVER['HTTP_ACCEPT'], 'application/xhtml+xml')) ||
    (strstr($_SERVER['HTTP_USER_AGENT'], 'W3C_Validator')) ||
    (strstr($_SERVER['HTTP_USER_AGENT'], 'WDG_Validator')) )    {
    //
    // If any of the conditions match and a mediatype via GET is not set, set the appropiate media type
    $media_type_get = (!isset($_GET['mediatype'])) ? 'xhtml' : $_GET['mediatype'];
}
	
switch ($media_type_get) {
    //
    // For our lovely XHTML
    case 'xhtml':
        $http_media_type = 'application/xhtml+xml';
        break;
    //
    // For generic XML, or fancy mixed mode XML documents
    case 'xml':
    case 'axml':
    case 'xsl':
    case 'xslt':
        $http_media_type = 'application/xml';
        break;
    //
    // This is depreciated and has Unicode handling problems but what the hell
    case 'txml':
        $http_media_type = 'text/xml';
        break;
    //
    // The proposed MIME type for the XSLT2 spec
    case 'xslt2':
        $http_media_type = 'application/xslt+xml';
        break;
    //
    // This is funny for jokes :D
    case 'text':
        $http_media_type = 'text/plain';
        break;
    //
    // Boring old HTML for stupid browsers
    case 'html':
    default:
        $http_media_type = 'text/html';
        break;
}
	
// Sending HTTP headers
	
if ( !empty($error) && '404' == $error ) {
    if ( preg_match('/cgi/', php_sapi_name()) )
        @header('Status: 404 Not Found');
    else
        @header('HTTP/1.x 404 Not Found');
} else if ( empty($feed) ) {
    @header('X-Pingback: '. get_bloginfo('pingback_url'));
    @header('Content-type: ' . get_bloginfo('http_media_type') . '; charset=' . get_option('blog_charset'));

Then in ./wp-includes/template-functions-general.php, find…


    case 'html_type' :
        $output = get_option('html_type');
        break;
    case 'version':
        global $wp_version;
        $output = $wp_version;
        break;

Replace with…


    case 'html_type' :
        $output = get_option('html_type');
        break;
    case 'http_media_type':
        global $http_media_type;
        $output = $http_media_type;
        break;
    case 'version':
        global $wp_version;
        $output = $wp_version;
        break;

Lastly in ./wp-content/themes/your-template/header.php, find…


<meta http-equiv="Content-Type" content="<?php bloginfo('html_type'); ?>; charset=<?php bloginfo('charset'); ?>" />

Replace with…


<meta http-equiv="Content-Type" content="<?php bloginfo('http_media_type'); ?>; charset=<?php bloginfo('charset'); ?>" />

Depending on your theme, you may need to fix up the CSS, as <html>… being the root element and all, is now the canvas!

I’ve prepared a patch for WordPress 1.5.x to enable application/xhtml+xml and some more, which applies all the above changes plus fixing Kubrick as its CSS by default doesn’t quite work in “Draconian Mode™”. You can apply the patch by downoading it to your WordPress directory and issuing a simple:


patch -u -p1 < wordpress15x-xhtmlmime.patch

Happy patching! :D

Filed under: Software

23 Comments »

  1. I really like the idea to serve XHTML the way it should be served, but when I use your patch Firefox complains about undefine entities. Anything that begins with an ampersand and ends with a semicolon seems a no-go and it is not just the theme it is in. What did you do to fix this?

    Comment by Tjaard18:19:36 UTC on the 13th of May, 2005

  2. The thing is with XML, of which XHTML is a subset, is that named entities must be defined. The only ones which XML has by default are amp, lt, gt, apos & quot. When an XHTML browser sees XHTML, the DTD also defines all commonly used HTML entities such as nbsp, raquo and so on. To be honest I’ve never come across the issue with Firefox you mention (that includes really old versions before Firefox was called Firefox) since I’ve been sending XHTML with the XHTML MIME type for over 2 years now (with a small break when I first moved to WordPress).

    The only thing which may catch you out is if you include content to your blog via another source which may not encode characters at all, and thus letting slip of the odd ‘"’ or ‘<’… things which are quite easily solved by running the htmlspecialchars() function on offending strings in PHP.

    Comment by Jonathan Stanley18:47:56 UTC on the 13th of May, 2005

  3. The first thing that I encountered is that an xml declaration apparantly wasn’t allowed on the first line of the document. After that I got an xml parse error; indeed at the first occurrence of a &nbsp; . I’ll try it again using the default theme (definitely should work, right :) ? ) and else I’ll see what might be wrong with the server settings…

    I’ll keep you posted :) .

    Comment by Tjaard19:15:20 UTC on the 13th of May, 2005

  4. Okay, it seems to work now… just need some way to verify whether the pages are really served using the correct mime type, wget isn’t much help :) . Anyway, thanks for the patch, you got credited for it in my blog’s footer :) .

    Comment by Tjaard19:40:34 UTC on the 13th of May, 2005

  5. XML documents should always have the XML prologue, though a special case was made for XHTML when it’s sent as text/html so in the specification, this was optional. The issue you are having is actually a PHP configuration issue in that you’ve allowed “short tags”, so the following prologue in your template:


    <?xml version="1.0" encoding="UTF-8"?>

    … will actually get processed as PHP, which breaks things. ;)

    The proper fix is to turn short tags off (it only encourages even sloppier coding in my opinion) or use echo:


    <?php echo '<?xml version="1.0" encoding="UTF-8"?>'."\n"; ?>

    The default WordPress does work, though various plugins you might have enabled may not be so kind. There are still a few bugs (perhaps intended features) in WordPress where there are some special cases which will case XML parsing havoc, though such cases wouldn’t have been valid XHTML output anyway. One arse-saving trick would be to use libtidy (assuming you got it compiled into your PHP install) and output buffer WordPresses output, which will then always be well-formed and valid XHTML! :D

    Comment by Jonathan Stanley19:42:36 UTC on the 13th of May, 2005

  6. It just occured to me… in order to get a stats javascript thingie working I had to put the script in a CDATA section (like <!--//--><![CDATA[//><!--). Now I’m just wondering… on a typical wordpress main page all posts have a commented metadata field (with rdf stuff), shouldn’t those by in a CDATA section as well?

    Comment by Tjaard13:07:27 UTC on the 16th of May, 2005

  7. Actually, all the <script> tricks are mutually exclusive depending if the document is served text/html or application/xhtml+xml. The best way to handle it in WordPress is to treat like serving the XML declaration, do one thing if it’s in an XML mode, do something else if not… which actually is just what my Google AdSense Injector plugin does, though admittedly limited to only that application.

    Guess that means I should really write another plugin to allow conditional template content to be insert into one’s templates… I’ll post a new post once I’ve tested what I currently think should work, actually works! :P

    Comment by Jonathan Stanley14:14:20 UTC on the 16th of May, 2005

  8. Hmm, IE and Konqueror show me a ]]> where the CDATA sections end… The silly thing is that <!--//--><![CDATA[//><!-- /*some js*/ //--><!]]> works for javascript on both IE and other browsers. Perhaps I’ll write a plugin myself which just has a function that returns the correct delimiters depending on the browser based on yours; with your permission of course ;) .

    Comment by Tjaard15:33:55 UTC on the 16th of May, 2005

  9. The mixing of the comment and delimiter is more a breaking of the browser’s parser and getting the result we want… a bit like using HTML tag-soup, only worse.

    You’re welcome to write a plugin, just have to remember plugins have to adhere to GPL (WordPress requirement if memory serves) and the old copyright credit if applicable too… though I’m 90% done with mine, so depends how much of a rush you are in! :P

    Comment by Jonathan Stanley15:52:31 UTC on the 16th of May, 2005

  10. Regarding what to do with RDF, all you would do is uncomment it in XML modes and your page will then be a mixed XML document (XHTML + RDF). Issue then it though valid & well-formed XML, it will be invalid XHTML1.0. However, XHTML1.1 is modular and you could potentially create your own DTD with XHTML1.1 + RDF, much like what was done for XHTML + MathML + SVG.

    Comment by Jonathan Stanley18:07:27 UTC on the 16th of May, 2005

  11. I just found out that this content negotiation thing doesn’t work with KHTML based browsers, such as Safari, though these do support the correct mime type. You might want to add a ua string match with ‘KHTML’ to the same if statement that sees whether the ua is a validator. I did :) . As far as I know all builds of Konqueror and Safari have ‘KHTML’ in their ua string, so it should be pretty safe. I happened to find about it reading this wish list.

    But for now, have fun there in Hong Kong :) .

    Comment by Tjaard13:50:34 UTC on the 11th of June, 2005

  12. As far as user-agents are concerned, sniffing HTTP_ACCEPT is on balance the only safe way. KHTML should fix what they have there instead. Admittedly you could change it to sniff for KHTML in HTTP_USER_AGENT, as you have, though don’t forget the converse case if ture with Opera7… it does have application/xhtml+xml, yet has issues with handling named HTML entities in such a mode. So my little hack is more a start as to how to do things, rather than be a kitchen sink which is supposed to take into account for everything. :D

    Oh, and people should pay more attention to the dates I have when I’m writing about Hong Kong… I’m retro-blogging. ;)

    Comment by Jonathan Stanley17:31:34 UTC on the 11th of June, 2005

  13. I found a bug :) . Or… bug… Firefox doesn’t mind, but quite some browsers get pissed when they encounter named character entities which appear to be present in Wordpress’ own code. There are 2 occurences of &raquo; in template-functions-general.php and perhaps some more in other files in wp-include… you might want to incorporate that in the patch :) .

    I suddenly got complaints from Safari users as soon as I provided them with xhtml+xml as well ;) .

    Comment by Tjaard14:22:58 UTC on the 17th of June, 2005

  14. That’s why there’s the saying “trying to be too clever”… ;) Aside from the same issue affecting Opera 7.x, though it was the other way round, in that it sent an HTTP_ACCEPT with application/xhtml+xml when it couldn’t deal with named entities, which has since been rectified in Opera 8.

    Forcing application/xhtml+xml to browsers that may be able to render it, though don’t specifically advertise it is akin to putting them through a pretty though edge-case test as the number of web sites that use the full application/xhtml+xml mode could very well be counted in double digits.

    There’s nothing wrong with using XHTML named entities, just the browser that’s recieving has to know it’s getting something a little more complicated that vanilla XML. More or less the same issue occurs when trying to render MathML as that has it’s own set of named entities.

    Why else is application/xhtml+xml mode called “so strict it barely works”! ;) Anyway, it’s a good way from sorting the men out from the boys as far as web browsers are concerned! :D

    Comment by Jonathan Stanley14:53:46 UTC on the 17th of June, 2005

  15. […] Wordpress is uit de doos valide XHTML (bijna dan, de comments met alternerende kleurtjes niet). Maar wat vaak vergeten wordt is dit: XHTML dien je te serveren als application/xml+xhtml. Op zich is het begrijpelijk dat dat standaard niet gedaan wordt omdat Explorer het niet snapt. Maar ene Jonathan heeft een patch gemaakt die ervoor zorgt dat browsers die het kunnen ook daadwerkelijk het goeie mime type gebruiken om XHTML in Wordpress te serveren. Die patch heb ik ook gebruikt. Je merkt er waarschijnlijk nix van (als je al weet waar dit over gaat). Maar het was een kleine moeite om te zorgen dat ik ‘het goed doe’; iets wat soms erg handig is als je met medestudenten discussies over browsers en webdesign houdt ;) . […]

    Pingback by tjaard.nl » Blog archive » Are you being served…?17:00:03 UTC on the 13th of August, 2005

  16. Hmm, seems that a new patch needs to be made for Wordpress 2 :) . I understand what your patch is doing, just need to find out where to insert it into WP2’s code. …but of course I won’t mind if someone else’d find it out before I do ;) .

    Comment by Tjaard19:25:51 UTC on the 3rd of January, 2006

  17. Just had a quick grep through the 2.0 files and it seems the header stuff in ./wp-blog-header.php has now moved to ./wp-includes/classes.php with the function now called send_headers(). As for ./wp-includes/template-functions-general.php that looks much the same so the changes from 1.5 should carry across as is. Ditto with the template files.

    It’s funny how you asked about this just as I was looking at my 2.0 dev site and thinking “I should really get this draconian XHTML thing working again…”.

    I won’t issue a new patch till I’ve tested it myself, though you can try my suggestions and be my tester and let me know if it works! ;)

    Comment by Jonathan Stanley19:40:39 UTC on the 3rd of January, 2006

  18. Ah :) . Just one thing when you decide to make a patch: you may want to track down any named entities in Wordpress’ code that may show up on the blog (e.g. the » that appears in the title of a single blog entry). This because browsers like Camino do report themselves as xhtml compliant but can’t handle named entities. I manually fixed this in my wp1.5, in the files I mentioned here.

    Of course it’s you who is to decide about this, I just think of this as a tiny thing that enables users to patch and never have to worry about people not being able to see their blog :) .

    By the way, I’ve seen that the akismet spam filter is now included in Wordpress, it works excellent for me… but recent comments, the adsense injector (hey ;) ), browser sniff, google sitemap… at least I’m gonna assess whether they’ll work before I upgrade :) . But I’ll be happy to test the draconian XHTML and adsense injector for you :) .

    Comment by Tjaard12:59:15 UTC on the 4th of January, 2006

  19. Issues with XHTML named entities are browser issues so as far as that goes, I’m still saying “won’t fix”. The latest Camino 1.0b1 is based on Mozilla/Gecko 1.8 so that really should not have issues.

    I vaguely recall some recent pre-1.0 Camino’s being based on ancient versions of Gecko and exhibit this named entities bug.

    The AdSense injector should work fine, as it’s about the simplest plugin in the World which appears to be working fine on my dev site. :D

    Comment by Jonathan Stanley13:33:35 UTC on the 4th of January, 2006

  20. So if it is the browser… I got in touch with the Camino developers through mail, it seems they weren’t aware of it and told me they’d test it :) !

    …but what I’m wondering now about your blog… what is this ICBM meta tag doing with my geographic position as its coordinates :P ? (And what is the use of having the UA string and accepted MIME types of the UA present in meta tags :s ?) I feel spied somehow ;) .

    Comment by Tjaard18:03:30 UTC on the 4th of January, 2006

  21. I know it’s not a site bug as otherwise Firefox/Seamonkey/Opera and so forth will b0rk! ;) As for the geo con-ordinates… it’s quite definately mine! The user-agent info I have in there for debugging purposes when someone complains this blog is broken… ;)

    Comment by Jonathan Stanley23:57:01 UTC on the 4th of January, 2006

  22. Hmm, popping your code fragment at the start of send_headers() doesn’t work… I may try other things later on these weeks, but it isn’t as easy as it seemed at a first glance :( …

    Ah well, as long as my blog doesn’t get hax0red I’ll be fine I guess :) .

    Comment by Tjaard18:23:17 UTC on the 29th of January, 2006

  23. Won’t be able to help as I’m not downgrading to WordPress 2.0. Page generation times are already appalling with WordPress 1.x which are in the 1 to 2 second range and that is with PHP opcode caching (via eAccelerator) yet with WordPress 2.0, it goes up fivefold to 10+ seconds!

    Quite an annoyance when one is used to phpBB, particularly phpBB3, which can spit out some very complex pages in a matter of milliseconds. Just a pity none of us at phpBB have the time to wrote a blog based on the phpBB “Olympus” core…

    Comment by Jonathan Stanley18:43:09 UTC on the 29th of January, 2006

RSS feed for comments on this post.

Leave a comment

Due to continued annoyance from spam-bots, this site now uses a Captcha. Disabled users can still submit their comments via my contact form.

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Don't forget that this site support Gravatars!

(required)

(required)

Authorisation code image