Diaspora* Data Migration and Archival Lessons Learned
(So far)
This is a summary of my discoveries and learning over the past two months or so concerning Diaspora* data archives and references as well as JSON and tools for manipulating it, specifically
jq
.It is a condensation of conversation mostly at my earlier Data Migration Tips & Questions (2022-1-10) thread, though also scattered elsewhere. I strongly recommend you review that thread and address general questions there.
Discussion here should focus on the specific information provided, any additions or corrections, and questions on how to access/use specific tools. E.g., how to get #jq running on Microsoft Windows, which I don't have specific experience with.
Archival Philosophy
I'm neither a maximalist nor minimalist when it comes to content archival. What I believe is that people should be offered the tools and choices they need to achieve their desired goal. Where preservation is preferred and causes minimal harm, it's often desirable. Not everything needs to be preserved, but too it isn't necessary to burn down every library one encounters as one journeys through life.
In particular, I'm seeking to preserve access for myself and others to previous conversations and discussions, and to content that's been shared and linked elsewhere. Several of my own posts have been submissions to Hacker News and other sites, for example, and archival at, say, the Internet Archive or Archive Today will preserve at least some access.
This viewpoint seems not to be shared by key members of the Diaspora* dev team and some pod administrators. As such, I'll note that their own actions and views reduce choice and agency amongst members of the Diaspora* community. The attitude is particularly incongruous given Diaspora*'s innate reliance on federation and content propagation according to the original specified intent of the content's authors and creators. This is hardly the first time Diaspora* devs have put their own concerns far above those of members of the Diaspora* community.
Information here is provided for those who seek to preserve content from their own profiles on Diaspora* servers likely to go offline, in the interest of maximising options and achieving desired goals. If this isn't your concern or goal, you may safely ignore what follows.
Prerequisites
The discussion here largely addresses working with a downloaded copy of Diaspora* profile data in JSON format.
It presumes you have jq installed on your system, and have a Bash or equivalent command-line / scripting environment. Most modern computers can offer jq though you will have to install it: natively on Linux, any of the BSDs, MacOS (via Homebrew), Windows (via Cygwin or WSL), and Android (via Termux). iOS is the only mass-market exception, and even there you might get lucky using iSH.
Create your archive by visiting your Pod's /user/edit page and requesting EXPORT DATA at the bottom of that page.
If you have issues doing so, please contact your Pod admin or other support contact(s). Known problems for some Joindiaspora members in creating archives are being worked on.
## Diaspora* post URLs can be reconstructed from the post GUID
The Diaspora* data extract does not include a canonical URL, but you can create one easily:
Post URL = /posts/
So for the GUID
64cc4c1076e5013a7342005056264835
We can tack on:
- protocol:
https://
- host_name:
pluspora.com
Substitute your intended Pod's hostname here. - the string literal
/posts/
https://pluspora.com/posts/64cc4c1076e5013a7342005056264835
... which is the URL for a post by @Rhysy (
rhysy@pluspora.com
) in which I'd initially witten the comment this post is based on, at that post's Pluspora Pod origin.Given that Pluspora is slated to go offline a few weeks from now, Future Readers may wish to refer to an archived copy here:
https://archive.ph/Y8mar
Once you have the URL, you can start doing interesting things with it.
Links based on other Pod URLs can be created
Using our previous example, links for the post on, e.g., diasp.org, diaspora.glasswings.com, diasp.eu, etc., can be generated by substituting for
host_name
:- https://diasp.org/posts/64cc4c1076e5013a7342005056264835
- https://diasp.eu/posts/64cc4c1076e5013a7342005056264835
- https://diaspora.glasswings.com/posts/64cc4c1076e5013a7342005056264835
You can trigger federation by specifically mentioning a user at that instance and having them request the page.
I'm not sure of when specifically federation occurs --- when the notification is generated, when the notification is viewed, or when the post itself is viewed. I've experienced such unfederated posts (404s) often as I've updated, federated, and archived my own earlier content from Joindiaspora to Glasswings. If federation occurs at some time after initial publication and comments the post URL and content should resolve, but comments made prior to that federation will not propagate.
(Pinging a profile you control on another pod is of course an excellent way to federate posts to that pod.)
Once a post is federated to a set of hosts it will be reachable at those hosts. If it has not yet been federated, you'll receive a "404" page, usually stating "These are not the kittens you're looking for. Move along." on Diaspora* instances.
(I'm not aware of other ways to trigger federation, if anyone knows of methods, please advise in comments.)
Note that comments shown on a post will vary by Pod, when and how it was Federated, and any blocks or networking issues between other Pods from which comments have been made. Not all instances necessarily show the same content, inconsistencies do occur.
Links to archival tools can be created by prepending their URLs to the appropriate link
- Archive.Today: https://archive.is/https://pluspora.com/posts/64cc4c1076e5013a7342005056264835
- Internet Archive: https://web.archive.org/*/https://pluspora.com/posts/64cc4c1076e5013a7342005056264835
Note that the Internet Archive does not include comments, though Archive.Today does, see: https://archive.is/almMw vs. https://web.archive.org/web/20220224213824/https://pluspora.com/posts/64cc4c1076e5013a7342005056264835
To include later comments, additional archival requests will have to be submitted.
My Archive-Index script does all of the above
See My current jq project: create a Diaspora post-abstracter.
https://diaspora.glasswings.com/posts/ed03bc1063a0013a2ccc448a5b29e257
That still has a few rough edges, but works to create an archive index which can be edited down to size. There's a fair bit of "scaffolding" in the direct output.
Note that the OLD and NEW hosts in the script specify Joindiaspora and Glasswings specifically. You'll want to adapt these to YOUR OWN old and newPod hostnames.
The script produces output which (after editing out superflous elements) looks like this in raw form:
## 2012
### May
**Hey everyone, I'm #NewHere. I'm interested in #debian and #linux, among other things. Thanks for the invite, Atanas Entchev!**
> Yet another G+ refuge. ...
<https://diaspora.glasswings.com/posts/cc046b1e71fb043d>
[Original](https://joindiaspora.com/posts/cc046b1e71fb043d) :: [Wayback Machine](https://web.archive.org/*/https://joindiaspora.com/posts/cc046b1e71fb043d) :: [Archive.Today](https://archive.is/https://joindiaspora.com/posts/cc046b1e71fb043d)
(2012-05-17 20:33)
----
**Does anyone have the #opscodechef wiki book as an ePub? Only available formats are online/web, or PDF (which sucks). I'm becoming a rapid fan of the #epub format having found a good reader for Android and others for Debian/Ubuntu.**
> Related: strategies for syncing libraries across Android and desktop/laptop devices. ...
<https://diaspora.glasswings.com/posts/e76c078ba0544ad9>
[Original](https://joindiaspora.com/posts/e76c078ba0544ad9) :: [Wayback Machine](https://web.archive.org/*/https://joindiaspora.com/posts/e76c078ba0544ad9) :: [Archive.Today](https://archive.is/https://joindiaspora.com/posts/e76c078ba0544ad9)
(2012-05-17 21:29)
----
Which renders as:
2012
May
Hey everyone, I'm #NewHere. I'm interested in #debian and #linux, among other things. Thanks for the invite, Atanas Entchev!
Yet another G+ refuge. ...
https://diaspora.glasswings.com/posts/cc046b1e71fb043d
Original :: Wayback Machine :: Archive.Today
(2012-05-17 20:33)
Does anyone have the #opscodechef wiki book as an ePub? Only available formats are online/web, or PDF (which sucks). I'm becoming a rapid fan of the #epub format having found a good reader for Android and others for Debian/Ubuntu.Related: strategies for syncing libraries across Android and desktop/laptop devices. ...
https://diaspora.glasswings.com/posts/e76c078ba0544ad9
Original :: Wayback Machine :: Archive.Today
(2012-05-17 21:29)
I've been posting those in fragmenents by year as private posts to myself to facilitate both federation and archival of the content. In chunks as Diaspora* has a 2^16^ / 65,536 byte per-post size limit. It's a slow slog but I've only one more year (2021) to manually process at this point, with post counts numbering up to 535 per year.
The Internet Archive Wayback Machine (at Archive.org) accepts scripted archival requests
If you submit a URL in the form of
https://web.archive.org/save/<URL>
, the Wayback Machine will attempt to archive that URL.This can be scripted for an unattended backup request if you can generate the set of URLs you want to save.
Using our previous example, the URL would be:
https://web.archive.org/save/https://pluspora.com/posts/64cc4c1076e5013a7342005056264835
Clicking that link will generate an archive request.
(IA limit how frequently such a request will be processed.)
Joindiaspora podmins discourage this practice. Among the more reasonable concerns raised is system load.
I suggest that if you do automate archival requests, as I have done, you set a rate-limit or sleep timer on your script. A request every few seconds should be viable. As a Bash "one-liner" reading from the file
DIASPORA_EXTRACT.json.gz
(change to match your own archive file), which logs progress to the timestamped file run-log
with a YYYYMMDD-hms format, e.g., run-log.20220224-222158
:time zcat DIASPORA_EXTRACT.json.gz |
jq -r '.user .posts[] | "https://joindiaspora.com/posts/\(.entity_data .guid )"' |
xargs -P4 -n1 -t -r ~/bin/archive-url |
tee run-log.$(date +%Y%m%d-%H%M%S)
archive-url
is a Bash shell script:\#!/bin/bash
url=${1}
echo -e "Archiving ${url} ... "
lynx -dump -nolist -width=1024 "https://web.archive.org/save/${url}" |
sed -ne '/[Ss]aving page now/,/^$/{/./s/^[ ]*//p;}' |
grep 'Saving page now'
sleep 4
Note that this waits 4 seconds between requests (
sleep 4
), which limits itself to a maximum of 900 requests per hour. There is NO error detection and you should confirm that posts you think you archived actually are archived. (We can discuss methods for this in comments, I'm still working on how to achieve this.)The script could be improved to only process public posts, something I need to look into. Submitting private posts won't result in their archival, but it's additional time and load.
There is no automated submission mechanism for Archive.Today of which I'm aware.
Appending .json
to the end of a Diaspora* URL provides the raw JSON data for that post:
https://joindiaspora.com/posts/64cc4c1076e5013a7342005056264835.json
That can be further manipulated with tools, e.g., to extract original post or comment Markdown text, or other information. Using
jq
is useful for this as described in other posts under the #jq hashtag generally.Notably:
- Finding most frequent specific engagement peers
- Finding your most-engaged peers
- Extract the last (or other specified) comment(s) on a post
- Create a Diaspora archive-index
- "unjsonify-diaspora --- extract the original Markdown of a Diaspora* post. Note that this can be simplified to
jq -Mr '.text'
, excluding thesed
component (see comments on post).
As always: This is my best understanding
There are likely errors and omissions. Much of the behaviour and structure described is inferred. Corrections and additions are welcomed.
#DiasporaMigration #Migration #Diaspora #Help #Tips #JoindiasporaCom #jq #json #DataArchves #Archives
Doc Edward Morbius
in reply to Doc Edward Morbius • • •Diaspora* Migration & Data Archival: Finding your First Remote-Pod...
Glass Wings diaspora* social networkDoc Edward Morbius
in reply to Doc Edward Morbius • • •This post and discussion are specific tools and discoveries based on working with Diaspora* data exports and JSON tools over the past two months. That contrasts with the Q&A discussion of the related Diaspora Migration Tips and Questions Thread from which I've also mentioned you.
Please check the original copy on Glasswings to ensure you are seeing full comments, though diasp.org should be fully federated.
Generally:
- The Diaspora* dev team have committed to creating a profile-import utility. That does not presently exist. If and when it does, you should be able to upload your content to new Pod(s).
- To work with the archive yourself, both to see what's in it and to produce useful outputs, you'll want to use jq, a JSON parser and processor.
Using jq you can both see what is inside the archive and create outputs from it, such as lists of contacts and summaries of your posts and reshares.I'd appreciate if you'd mention these posts in the updated announcement thread if you find this useful.
An update on the future of JoinDiaspora.com
Glass Wings diaspora* social networkCarsten Raddatz likes this.
Doc Edward Morbius
in reply to Doc Edward Morbius • • •Quoting from the Archive:
The API can be used as follows:
http://archive.org/wayback/available?url=example.com
which might return:
if the url is available. When available, the url is the link to the archived snapshot in the Wayback Machine At this time, archived_snapshots just returns a single closest snapshot, but additional snapshots may be added in the future.
If the url is not available (not archived or currently not accessible), the response will be:
https://archive.org/help/wayback_api.php
The query URLs can be constructed from the Diaspora* archive, fed to a web query via
curl
orwget
, and logged, to see which of your posts have been captured.Using this, I've determined that 1,932 snapshots of my 2,659 total Joindiaspora posts have been captured, for a 73% overall success rate.
Wayback Machine APIs | Internet Archive
archive.orgCarsten Raddatz likes this.
Doc Edward Morbius
in reply to Doc Edward Morbius • • •Joindiaspora Total Size
A comment by the Joindiaspora-Sunset account reveals the total size of the Joindiaspora data (largely text content) and data (largely image) storage.https://diaspora.glasswings.com/posts/537493b07655013ae0f352540086c3e0#eea925b08066013ae10352540086c3e0
That is for
- 2,966,990 posts
- 1,412039 comments
- 4,379,029 total content items.
- 519 MAU
- 1147 6-months actives
That's roughly 10 kB content per item (posts + comments).
And about 170 kB of graphical / file content per post (comments can have linked images, but not uploaded ones).
It's about 100 MB total data (text) storage per MAU, and 43 MB per six-months actives.
And about 1 GB per MAU / 440 MB per six-months actives in photos.
On average.
An update on the future of JoinDiaspora.com
Glass Wings diaspora* social networkFiXato likes this.