Lots of work going on, finally! I've spent some time in the last few days doing all sorts of development work (instead of grading, *cough*).
First: the Sciveyor issues and projects pages are now filled with delicious goodness about our future plans. Not just bugs, but also the status of corpus maintenance and our medium/long term plans for rewriting the architecture of the software as a whole. This was literally spread across seven to-do lists in various places (code, internal notebooks, issues, TODO.md in the project itself, etc., etc.). No wonder I couldn't keep track of what we were up to.
Second: there's some proper changes to the code going on. I'm steadily working on a number of ideas around our dataset and analysis task handling. The goal is to make datasets an independent object, not something owned by users -- that is, each created dataset receives a UUID and can be referenced with a public URI. Users then save a list of datasets they're interested in, but don't "own" them.
That radically changes our ownership model. We had users owning datasets, which owned the tasks that had been run on them. That was all a mess. It didn't make sense for datasets to own their analysis tasks -- because some tasks are run on more than one dataset (e.g., comparison tasks)! It also now doesn't make sense for users to own datasets -- because they're public objects!
So I've been reworking that entire model and controller collection. It's still a little bit janky, but it all makes much more sense than it did before. That whole set of connected changes should be finished within the next week or two!
I'm happy to report that things are finally moving forward again on DH projects! As expected, replacing the entire primary data model (i.e., the Document object) at the heart of Sciveyor has been a huge pain, so it's basically been a long process of re-running unit tests and chipping away at failures until they disappear. Major sticking points have been the Author objects -- which used to be parsed out of a string but are now first-class Solr child documents -- and the fact that we've gone from representing publication dates as raw years to proper Solr Date objects (which in turn become Ruby DateTime objects). A bunch of hacky year manipulation code is gone, but also that means that I need to think more about when "full date" support is useful and when collapsing to "years" by default is a good idea.
The other thing that this has done is exposed a bunch of really nasty internal details to the outside world -- that is, our faceted-search queries, which used to be nice and simple and inoffensive in the browser's URL bar, are now painfully complicated and need to be hidden. My first target after I get the test-suite fixed (that's 4.0.0-alpha1) is to hide those details from the frontend -- the frontend can just send the server queries like "...&facet="author:John Doe"" and the backend will handle translating that to a complex child-document query.
Development on Sciveyor has obviously been stalled for a couple of months. I've been buried in other projects (lots of things making their way through review), and I've also been dedicating what programming time I have to a system for automatically clustering conference abstracts by subject and arranging them into time-blocks. It's based on a paper published in 2019 and data used to set up some meetings in 2013 and 2014. You can check it out here, if you're interested: https://codeberg.org/cpence/ish-abstracts
Not really a "development" milestone so much as a "data management" milestone, but all 5+GB of our JSON files with the full text content of Nature just passed JSON validation using our shiny new validation tool, after an evening's worth of patching tiny data consistency bugs. Time to spend tomorrow getting the centralized MongoDB server up and running!
Didn't do much coding these last few weeks, as I got buried in the avalanche of semester-end grading. Just got back to it with the idea of polishing up the Mongo-Solr tool I talked about in the last post. Used Kong to rework the CLI, and now I've been thinking about what other roles this thing can play.
There's two big "data-transit" points in our workflow. First, passing from raw JSON on my NAS into the MongoDB server in the first place. That's how we onboard new data we've scraped, received, or transformed from external content providers like journal publishers. Second is passing from the MongoDB server to the Solr server that powers all of (and, even in our future plans for tool development, this will still be much of) our search and textual analysis system.
At both of those points, we can (and probably should!) apply various kinds of transforms. If nothing else, we move back and forth from some proper date-types in Mongo (Unix timestamps) to RFC3339/ISO8601 string-format dates. We also rip out at the very least some Mongo-specific properties that Solr shouldn't know about (like the internal Mongo _id) -- there may well be more properties like this when we start doing citation network analysis and similar data-integration work inside of the Mongo server!
So the next goal for me is to think more generally about this "transform" layer. What kinds of automated cleanups, changes, tweaks, etc. might it make sense to make while our data is moving around? Standardizing those inside this tool will give us the chance to make them clear, repeatable, and audit-able.
Related to the post from yesterday, an interesting quirk of MongoDB if you want to use it as your centralized datastore: there's a maximum document size of 16MB. That's no problem for Sciveyor as it sits, because its designed to be a database for journal articles. But if you wanted to adapt our system to something that was holding, say, OCR'ed book manuscripts, or you needed to carry the image/PDF data in your MongoDB, you'd have to figure out how to set up the GridFS layer that allows you to do that transparently.
I'm going to try to pick up an idea that I had on here before about using my Mastodon account to replace my (very!) old Advogato account. (Anybody else remember Advogato? Good times.) After thinking about that more, I decided maybe a half-hidden corner of my personal site would be a better spot. So here comes the first development post (of hopefully many?) about digital humanities.
So first, for those of you who don't know me from a programming perspective, I'm the head developer on the project formerly known as RLetters and evoText, now known as Sciveyor. (More on this again later, but I got tired of the software and the main instance of the software having a different name.) It's a system for performing textual analysis on journal articles. We've been built around Ruby/Rails + Solr for about a decade now. That's all slowly changing, but that's a story for future posts.
Now, we've had real data integration problems in the project over the years. The "central data store" was XML files on my NAS. That's cool and all for long-term archiving (and it works for feeding stuff into Solr), but it sucks for long-term project maintenance.
The fix: baptize our MongoDB instance as the "canonical" data store. I've got a JSON schema for our data (https://data.sciveyor.com/schema/ + https://codeberg.org/sciveyor/json-schema) and a tool to verify Mongo against it (https://codeberg.org/sciveyor/schema-tool)
So now we have a central source of data that we can check for validity (very quickly in Go). Today's goal was a super-quick-and-dirty way to mirror the contents of the MongoDB server over to Solr (https://codeberg.org/sciveyor/mongo-solr, more Go). It only checks version numbers and presence/absence, but it'll be good enough to keep the Solr server current.
And hey: in the intervening years since I last redid our schema, Solr now has decent nested-document support! Woohoo!
In any event I think it'll be helpful for me to keep thinking through this stuff somewhere, and here is as good as anywhere else Talking to myself about what I've done and what I'm going to do usually makes me a better developer.