Couldn’t leave well enough alone over the weekend, so I poked at the Unicode situation. Much ranting on social media later, I think I’ve figured out what’s happening, and for a change it wasn’t my own fault.

My Mastodon display name is Karen C(pretzel emoji)(old woman)(pine tree)(ocean). It’s geographic: there’s Philly, there’s me, there’s the Pinelands, there’s the Atlantic. (It used to be pine-woman-ocean, and I kind of miss the ocean a lot, but I also like being closer to the city.) It’s also problematic for Wirebird at the moment.

Sometimes it came across as a string that was pretty clearly double- or even triple-escaped, but decoding it would sometimes result in emoji and other times result in an error because it was trying to over-decode things. The inconsistency was a little puzzling, so I threw in some print statements and figured it out.

I’m using XML::Feed to process both RSS and Atom feeds. It’s a wrapper around XML::RSS and XML::Atom which are by two different authors (neither of which is the XML::Feed author), and XML::Feed papers over the inconsistencies… most of the time. But Mastodon’s RSS feeds are not coming in the same way its Atom feeds are.

  • Atom feed: utf8 encoded; utf8 flag set; not escaped
  • RSS feed: not utf8 encoded; utf8 flag not set; not escaped

The solution seems to be to give the two ::XMLFeed libraries a subroutine that checks for the flag, and if it’s not found decodes the string (and sets the flag).

It’ll take looking at a lot more feeds to make sure this is working as I expect; Mastodon seems to be doing everything right as far as http headers and internal encoding, and not all feeds will be as well-behaved.

Comment? Email it to me. (I'll assume I can publish it unless you say otherwise)

Next post: What I’m reading: WebID and Access Control

Previous post: Cleaning house: testing everything