Not really a "development" milestone so much as a "data management" milestone, but all 5+GB of our JSON files with the full text content of Nature just passed JSON validation using our shiny new validation tool, after an evening's worth of patching tiny data consistency bugs. Time to spend tomorrow getting the centralized MongoDB server up and running!
Didn't do much coding these last few weeks, as I got buried in the avalanche of semester-end grading. Just got back to it with the idea of polishing up the Mongo-Solr tool I talked about in the last post. Used Kong to rework the CLI, and now I've been thinking about what other roles this thing can play.
There's two big "data-transit" points in our workflow. First, passing from raw JSON on my NAS into the MongoDB server in the first place. That's how we onboard new data we've scraped, received, or transformed from external content providers like journal publishers. Second is passing from the MongoDB server to the Solr server that powers all of (and, even in our future plans for tool development, this will still be much of) our search and textual analysis system.
At both of those points, we can (and probably should!) apply various kinds of transforms. If nothing else, we move back and forth from some proper date-types in Mongo (Unix timestamps) to RFC3339/ISO8601 string-format dates. We also rip out at the very least some Mongo-specific properties that Solr shouldn't know about (like the internal Mongo _id) -- there may well be more properties like this when we start doing citation network analysis and similar data-integration work inside of the Mongo server!
So the next goal for me is to think more generally about this "transform" layer. What kinds of automated cleanups, changes, tweaks, etc. might it make sense to make while our data is moving around? Standardizing those inside this tool will give us the chance to make them clear, repeatable, and audit-able.
Related to the post from yesterday, an interesting quirk of MongoDB if you want to use it as your centralized datastore: there's a maximum document size of 16MB. That's no problem for Sciveyor as it sits, because its designed to be a database for journal articles. But if you wanted to adapt our system to something that was holding, say, OCR'ed book manuscripts, or you needed to carry the image/PDF data in your MongoDB, you'd have to figure out how to set up the GridFS layer that allows you to do that transparently.
I'm going to try to pick up an idea that I had on here before about using my Mastodon account to replace my (very!) old Advogato account. (Anybody else remember Advogato? Good times.) After thinking about that more, I decided maybe a half-hidden corner of my personal site would be a better spot. So here comes the first development post (of hopefully many?) about digital humanities.
So first, for those of you who don't know me from a programming perspective, I'm the head developer on the project formerly known as RLetters and evoText, now known as Sciveyor. (More on this again later, but I got tired of the software and the main instance of the software having a different name.) It's a system for performing textual analysis on journal articles. We've been built around Ruby/Rails + Solr for about a decade now. That's all slowly changing, but that's a story for future posts.
Now, we've had real data integration problems in the project over the years. The "central data store" was XML files on my NAS. That's cool and all for long-term archiving (and it works for feeding stuff into Solr), but it sucks for long-term project maintenance.
The fix: baptize our MongoDB instance as the "canonical" data store. I've got a JSON schema for our data (https://data.sciveyor.com/schema/ + https://codeberg.org/sciveyor/json-schema) and a tool to verify Mongo against it (https://codeberg.org/sciveyor/schema-tool)
So now we have a central source of data that we can check for validity (very quickly in Go). Today's goal was a super-quick-and-dirty way to mirror the contents of the MongoDB server over to Solr (https://codeberg.org/sciveyor/mongo-solr, more Go). It only checks version numbers and presence/absence, but it'll be good enough to keep the Solr server current.
And hey: in the intervening years since I last redid our schema, Solr now has decent nested-document support! Woohoo!
In any event I think it'll be helpful for me to keep thinking through this stuff somewhere, and here is as good as anywhere else Talking to myself about what I've done and what I'm going to do usually makes me a better developer.