Apparently one of my husband’s co-workers thinks I’m the smarter one.
Joke’s on him: I spent today failing to properly install mod_perl on a basic Apache install, then giving up and failing to even correctly activate a cgi-bin dir.
(Look, it’s been awhile, okay?)
The blog source is in the repo now, but otherwise not much progress - largely due to the need to drive across the state and back today. I’ve fine-tuned some bits of the XML feed processing, but that’s it.
The next step is to get the
poll_subscriptions stub filled out again.
It’s kind of exciting to see the triples dumped to screen. Once the
polling sub puts them into the permanent store, it’s time to start
playing around with some SPARQL queries and see if it’s as magical as
its backers claim.
Yesterday’s work on converting XML to triples is mostly finished; most attributes are dumped but that’s fairly trivial to fix… once I decide what it should be doing with them.
Wirebird::Remote::XML is still dumping a lot of diagnostics to console, but
I’ve restored the testing scripts. It’s not passing, of course, but it’s happily
subscribing to a feed. Feed polling is still part of W::R::XMLFeed, which is
Wirebird::Remote::XML builds a model including the
entries so that’ll be a quick conversion.
It’s too late for today’s commit push, but I think I’m going to move the blog source into the repo; it is, after all, the only changelog at this point.
Still kind of thinking about the library to use for processing Atom, RSS, and other non-RDFa XML.
Ended up expanding Wirebird::Remote::XML to play around with processing it manually, and at a first pass it works moderately well. Here is what I get when I run the Atom feed for my Mastodon profile through it.
This is what Wirebird::Remote::RDFa (via RDF::RDFa::Parser) gets. I haven’t quite figured out why Mastodon gets recognized as RDFa when Plerd’s feed doesn’t, but I haven’t fed it to a validator to see yet.
<https://mastodon.social/users/gamehawk.atom> <http://www.iana.org/assignments/relation/alternate> <https://mastodon.social/@gamehawk> ; <http://www.iana.org/assignments/relation/avatar> <https://files.mastodon.social/accounts/avatars/000/007/209/original/1a8cc570acbd05ea.png> ; <http://www.iana.org/assignments/relation/header> <https://files.mastodon.social/accounts/headers/000/007/209/original/media.jpg> ; <http://www.iana.org/assignments/relation/hub> <https://mastodon.social/api/push> ; <http://www.iana.org/assignments/relation/next> <https://mastodon.social/users/gamehawk.atom?max_id=8115759> ; <http://www.iana.org/assignments/relation/salmon> <https://mastodon.social/api/salmon/7209> ; <http://www.iana.org/assignments/relation/self> <https://mastodon.social/users/gamehawk.atom> .
This is what the current Wirebird::Remote::XMLFeed gets, parsing with XML::FeedPP and then dumping things into SIOC fields.
<https://mastodon.social/@gamehawk> <http://purl.org/dc/terms/created> "2018-06-14T23:26:25Z" ; <http://rdfs.org/sioc/ns#description> "Wandering ex-Jayhawker (not the same as a Jayhawk, but close), currently in Jersey (Philly area). Freelance Perl coder. She/her." ; <http://rdfs.org/sioc/ns#feed> <https://mastodon.social/@gamehawk> ; <http://rdfs.org/sioc/ns#link> <https://mastodon.social/@gamehawk> ; <http://rdfs.org/sioc/ns#name> "Karen C\U0001F968\U0001F475\U0001F3FB\U0001F332\U0001F3D6\uFE0F" ; a <http://rdfs.org/sioc/ns#WebLog> .
This is what Wirebird::Remote::XML can get, parsing with LibXML and doing a little hard-coded processing. It’s not using the feed’s namespaces yet, just the default ones built into RDF::Trine. Prefixes are not standardized, so this is a shortcut that should only be used with tame data, if then, but it’ll do for now. These should really be
http://www.w3.org/2005/Atom rather than the p[ermanent]url RSS RDF::Trine guessed it as.
<https://mastodon.social/users/gamehawk.atom> <http://purl.org/rss/1.0/id> <https://mastodon.social/users/gamehawk.atom> ; <http://purl.org/rss/1.0/logo> <https://files.mastodon.social/accounts/avatars/000/007/209/original/1a8cc570acbd05ea.png> ; <http://purl.org/rss/1.0/subtitle> "Wandering ex-Jayhawker (not the same as a Jayhawk, but close), currently in Jersey (Philly area). Freelance Perl coder. She/her." ; <http://purl.org/rss/1.0/title> "Karen C\U0001F968\U0001F475\U0001F3FB\U0001F332\U0001F3D6\uFE0F" ; <http://purl.org/rss/1.0/updated> "2018-06-14T23:26:25Z" ; <http://purl.org/vocab/relationship/alternate> <https://mastodon.social/@gamehawk> ; <http://purl.org/vocab/relationship/hub> <https://mastodon.social/api/push> ; <http://purl.org/vocab/relationship/next> <https://mastodon.social/users/gamehawk.atom?max_id=8121439> ; <http://purl.org/vocab/relationship/salmon> <https://mastodon.social/api/salmon/7209> ; <http://purl.org/vocab/relationship/self> <https://mastodon.social/users/gamehawk.atom> ; a <http://schema.org/WebPage> .
Processing the (nicely fleshed out)
author in the feed otherwise goes on to give us:
<https://mastodon.social/users/gamehawk> <http://purl.org/rss/1.0/email> "firstname.lastname@example.org" ; <http://purl.org/rss/1.0/id> <https://mastodon.social/users/gamehawk> ; <http://purl.org/rss/1.0/name> "gamehawk" ; <http://purl.org/rss/1.0/summary> "<p>Wandering ex-Jayhawker (not the same as a Jayhawk, but close), currently in Jersey (Philly area). Freelance Perl coder. She/her.</p>" ; <http://purl.org/rss/1.0/uri> <https://mastodon.social/users/gamehawk> ; <http://purl.org/vocab/relationship/alternate> <https://mastodon.social/@gamehawk> ; <http://purl.org/vocab/relationship/avatar> <https://files.mastodon.social/accounts/avatars/000/007/209/original/1a8cc570acbd05ea.png> ; <http://purl.org/vocab/relationship/header> <https://files.mastodon.social/accounts/headers/000/007/209/original/media.jpg> ; "" <http://activitystrea.ms/schema/1.0/person>, "Karen C\U0001F968\U0001F475\U0001F3FB\U0001F332\U0001F3D6\uFE0F", "Wandering ex-Jayhawker (not the same as a Jayhawk, but close), currently in Jersey (Philly area). Freelance Perl coder. She/her.", "gamehawk", "public" .
(Something about the
<activity:object-type>http://activitystrea.ms/schema/1.0/person</activity:object-type> line is confusing it there at the end, so I’ll have to track that down.)
So I have three resources going on here:
https://mastodon.social/users/gamehawk- Retrieving this with a browser redirects to…
https://mastodon.social/@gamehawk- … which has as a
https://mastodon.social/users/gamehawk.atom… which lists itself as its
atom:id, the second link as its
rel:alternateand the first as the author’s
Between the redirect and the
alternate Wirebird should probably figure out that these are all really the same resource, but that’s for down the road.
I’ve been digging around in the Solid repo to see how MIT/TBL have been solving this. The answer, again, is “generally handing it off to already-existing identities.” There’s a little bit, though: a signup app and, perhaps most relevant, a list of identity providers in JSON format. It’s not (yet?) any kind of machine-readable standard, and as of this writing its only entry, Databox.me, serves an nginx default page on its http port and a demo on its https, but it’s a start.
Server capability discovery mentions Turtle and JSON-LD being on the roadmap, so that’s promising.
It’s been a month since I started the project, so it seems like a good place to pause and summarize where everything is. The goal is to replace Facebook. The minimum viable product is, first, a feed reader - whose most important feature is that a user doesn’t know it’s a feed reader. The user just “follows” people and organizations, just like on Facebook, and things magically appear on her feed/wall, just like on Facebook.
It’s also standards-based, because there’s no point in replacing Facebook with another silo. That doesn’t entirely matter until the blogging side of things is in place, but the triplestore is there under the hood.
The repo still doesn’t really have a functioning app yet, but it’s out there in the world mostly to keep me accountable.
At one point, the critter could actually read Atom and RSS feeds and put them in the triplestore (and the ActivityPub inbox), but I’ve taken that apart again and that’s been my focus lately: a robust, extensible system to retrieve wild data and process it into proper Linked Data. Syndication feeds, being already XML, should require the least processing but of course it turns out that CPAN libraries are mostly geared toward getting those feeds farther from machine-readable instead.
So here’s where I’m at on things it does and needs to do for MVP:
Emphasized items are incomplete or nonexistent.
On a tangentially related note: I visited a Philadelphia Perl Mongers meeting last night, since the topic was Linked Data. The speakers described building the Global Change Information System which, of course, runs on something a little more powerful than a Raspberry Pi, and for an audience a little more tech-savvy than my hypothetical end users. It was still pretty interesting, if for no other reason than it was good to hear that somebody other than People In 2007 are actively developing these things, even if not in the particular ontologies I’m working with.
So it turns out that, on digging into the XML::Atom source, I happened to notice the pod included a note about Unicode that I had somehow previously missed. Adding
$XML::Atom::ForceUnicode = 1;
to Wirebird::Remote::XMLFeed eliminated the need to decode/flag the incoming data. Not sure how I overlooked that.
And then I noticed that XML::FeedPP had some substantial bugfixes, and decided to install and play with it. Alas, it’s built on top of XML::TreePP which seems considerably less robust than LibXML.
Both of them drop everything but the most basic fields, in the name of providing a standardized interface. This is usually fine since, honestly, almost no one fills in the more esoteric Atom fields. But if the data is there, I’d like to grab it, which means I’m increasingly leaning toward dropping the off-the-shelf library and just fleshing out Wirebird::Remote::XML.
This probably means it’s time to learn GRDDL, a tortured acronym for a method of converting XML to XML/RDF by means of a ruleset. Which really means this should be a “what I’m reading” entry rather than a “coding out loud” entry, since I’m scrapping code instead of committing it. Oops.
On a more meta note: it’s graduation week, so probably another light schedule for Wirebird work.
My development environment these days is a Raspberry Pi 3B+, which is rather depressingly close to the specs of my regular desktop machine. It lives out in the media room, hooked up to the TV, on the network via wifi, and I have it crosslinked to my desktop machine via sshfs. Just by way of keeping my environment consistent, I use Geany to edit it regardless of which keyboard I’m at.
The Perl run environment is on the Pi, by way of keeping me honest as far as performance goes - inefficient code makes itself felt very quickly. Disk access is the most noticeable.
All of which is a lengthy excuse for why I’m just now getting around to installing Perl::Critic. Perltidy has been around but not reliably used (“get it running in Geany” is off on the yak-shaving list).
Perlcritic caught a missing “use strict” in one library, but otherwise
-gentle with flying colors.
-stern was a whole different
matter, but it now almost passes
-harsh which, honestly, is as far as
I’m likely to go. The one exception is
Wirebird::Handler::XMLFeed::pollSubscription() which accuses of high
complexity and… it’s not wrong. That subroutine is one big TODO on
Perl makes some compromises in its Unicode implementation to avoid breaking old code and boy it sure makes new code work funny sometimes.
Not too much to say today; coding right now is just a matter of:
It’s tedious, and there are all sorts of entertaining pitfalls (like realizing that UXTerm on my desktop machine displays things differently than the Pi does, and that I need to put everything in tests and not just inspect it in console).
As the Wirebird::Remote hierarchy gradually fills in, there are some
things I need to consider. As I’ve alluded to, very often a resource
will have multiple sources of
and Wirebird will need to decide what’s best because there is no
“authoritative” source. The
bid() system currently just parses the
best available source (in terms of what the standard can provide, not
necessarily what a particular implementation does provide), but
ultimately it will need to parse all sources above a minimum quality.
Back to my mastodon.social profile as an example. Currently, Mastodon can parse it as follows, in order of preference:
The RSS doesn’t offer anything the Atom doesn’t, but if the Atom is
missing, the RSS doesn’t have a lot of things the HTML does. But the
standard offers it, so I don’t want to necessarily bump RSS below HTML
in priority. So it’s probably going to be necessary for the
Remote::retrieve() routines to not stomp on already-existing
When processing the subscriptions, things get even more interesting. If we process the feed entries, that RSS still doesn’t have authorship information - but if we follow the link to the page for each status we have a lot more options. That means a lot more http calls, though. Probably unavoidable - I just have to keep track of whether a link has been thoroughly mined out, so I don’t keep calling it every time Wirebird notices it.
“Minimum viable product” turns out not to be so “minimal” when you’re trying to build a solid foundation.
No commits for a couple days, sorry. I was wrong last week when I said I’d have more time this week.