Skip to main content


Diaspora* Data Migration and Archival Lessons Learned


(So far)

This is a summary of my discoveries and learning over the past two months or so concerning Diaspora* data archives and references as well as JSON and tools for manipulating it, specifically jq.

It is a condensation of conversation mostly at my earlier Data Migration Tips & Questions (2022-1-10) thread, though also scattered elsewhere. I strongly recommend you review that thread and address general questions there.

Discussion here should focus on the specific information provided, any additions or corrections, and questions on how to access/use specific tools. E.g., how to get # running on Microsoft Windows, which I don't have specific experience with.

Archival Philosophy


I'm neither a maximalist nor minimalist when it comes to content archival. What I believe is that people should be offered the tools and choices they need to achieve their desired goal. Where preservation is preferred and causes minimal harm, it's often desirable. Not everything needs to be preserved, but too it isn't necessary to burn down every library one encounters as one journeys through life.

In particular, I'm seeking to preserve access for myself and others to previous conversations and discussions, and to content that's been shared and linked elsewhere. Several of my own posts have been submissions to Hacker News and other sites, for example, and archival at, say, the Internet Archive or Archive Today will preserve at least some access.

This viewpoint seems not to be shared by key members of the Diaspora* dev team and some pod administrators. As such, I'll note that their own actions and views reduce choice and agency amongst members of the Diaspora* community. The attitude is particularly incongruous given Diaspora*'s innate reliance on federation and content propagation according to the original specified intent of the content's authors and creators. This is hardly the first time Diaspora* devs have put their own concerns far above those of members of the Diaspora* community.

Information here is provided for those who seek to preserve content from their own profiles on Diaspora* servers likely to go offline, in the interest of maximising options and achieving desired goals. If this isn't your concern or goal, you may safely ignore what follows.

Prerequisites


The discussion here largely addresses working with a downloaded copy of Diaspora* profile data in JSON format.

It presumes you have jq installed on your system, and have a Bash or equivalent command-line / scripting environment. Most modern computers can offer jq though you will have to install it: natively on Linux, any of the BSDs, MacOS (via Homebrew), Windows (via Cygwin or WSL), and Android (via Termux). iOS is the only mass-market exception, and even there you might get lucky using iSH.

Create your archive by visiting your Pod's /user/edit page and requesting EXPORT DATA at the bottom of that page.

If you have issues doing so, please contact your Pod admin or other support contact(s). Known problems for some Joindiaspora members in creating archives are being worked on.
## Diaspora* post URLs can be reconstructed from the post GUID

The Diaspora* data extract does not include a canonical URL, but you can create one easily:

Post URL = /posts/

So for the GUID 64cc4c1076e5013a7342005056264835

We can tack on:
  • protocol: https://
  • host_name: pluspora.com Substitute your intended Pod's hostname here.
  • the string literal /posts/
to arrive at:

https://pluspora.com/posts/64cc4c1076e5013a7342005056264835

... which is the URL for a post by @Rhysy (rhysy@pluspora.com) in which I'd initially witten the comment this post is based on, at that post's Pluspora Pod origin.

Given that Pluspora is slated to go offline a few weeks from now, Future Readers may wish to refer to an archived copy here:
https://archive.ph/Y8mar

Once you have the URL, you can start doing interesting things with it.

Links based on other Pod URLs can be created


Using our previous example, links for the post on, e.g., diasp.org, diaspora.glasswings.com, diasp.eu, etc., can be generated by substituting for host_name:Simply having a URL on a pod does not ensure that the content will be propagated. A member of that pod must subscribe to the post first. In many cases this occurs through followers, though occasionally it does not.

You can trigger federation by specifically mentioning a user at that instance and having them request the page.
I'm not sure of when specifically federation occurs --- when the notification is generated, when the notification is viewed, or when the post itself is viewed. I've experienced such unfederated posts (404s) often as I've updated, federated, and archived my own earlier content from Joindiaspora to Glasswings. If federation occurs at some time after initial publication and comments the post URL and content should resolve, but comments made prior to that federation will not propagate.

(Pinging a profile you control on another pod is of course an excellent way to federate posts to that pod.)

Once a post is federated to a set of hosts it will be reachable at those hosts. If it has not yet been federated, you'll receive a "404" page, usually stating "These are not the kittens you're looking for. Move along." on Diaspora* instances.

(I'm not aware of other ways to trigger federation, if anyone knows of methods, please advise in comments.)

Note that comments shown on a post will vary by Pod, when and how it was Federated, and any blocks or networking issues between other Pods from which comments have been made. Not all instances necessarily show the same content, inconsistencies do occur.

Links to archival tools can be created by prepending their URLs to the appropriate link

Those will either show existing archives if they exist or provide links to submit the post if they do not.

Note that the Internet Archive does not include comments, though Archive.Today does, see: https://archive.is/almMw vs. https://web.archive.org/web/20220224213824/https://pluspora.com/posts/64cc4c1076e5013a7342005056264835

To include later comments, additional archival requests will have to be submitted.

My Archive-Index script does all of the above


See My current jq project: create a Diaspora post-abstracter.

https://diaspora.glasswings.com/posts/ed03bc1063a0013a2ccc448a5b29e257

That still has a few rough edges, but works to create an archive index which can be edited down to size. There's a fair bit of "scaffolding" in the direct output.

Note that the OLD and NEW hosts in the script specify Joindiaspora and Glasswings specifically. You'll want to adapt these to YOUR OWN old and newPod hostnames.

The script produces output which (after editing out superflous elements) looks like this in raw form:
## 2012

### May


**Hey everyone, I'm #NewHere. I'm interested in #debian and #linux, among other things. Thanks for the invite, Atanas Entchev!**

> Yet another G+ refuge. ...

<https://diaspora.glasswings.com/posts/cc046b1e71fb043d> 
[Original](https://joindiaspora.com/posts/cc046b1e71fb043d) :: [Wayback Machine](https://web.archive.org/*/https://joindiaspora.com/posts/cc046b1e71fb043d) :: [Archive.Today](https://archive.is/https://joindiaspora.com/posts/cc046b1e71fb043d) 

(2012-05-17 20:33)

----


**Does anyone have the #opscodechef wiki book as an ePub?  Only available formats are online/web, or PDF (which sucks).  I'm becoming a rapid fan of the #epub format having found a good reader for Android and others for Debian/Ubuntu.**

> Related:  strategies for syncing libraries across Android and desktop/laptop devices. ...

<https://diaspora.glasswings.com/posts/e76c078ba0544ad9> 
[Original](https://joindiaspora.com/posts/e76c078ba0544ad9) :: [Wayback Machine](https://web.archive.org/*/https://joindiaspora.com/posts/e76c078ba0544ad9) :: [Archive.Today](https://archive.is/https://joindiaspora.com/posts/e76c078ba0544ad9) 

(2012-05-17 21:29)
----

Which renders as:

2012

May


Hey everyone, I'm #. I'm interested in # and #, among other things. Thanks for the invite, Atanas Entchev!
Yet another G+ refuge. ...

https://diaspora.glasswings.com/posts/cc046b1e71fb043d
Original :: Wayback Machine :: Archive.Today

(2012-05-17 20:33)
Does anyone have the # wiki book as an ePub? Only available formats are online/web, or PDF (which sucks). I'm becoming a rapid fan of the # format having found a good reader for Android and others for Debian/Ubuntu.
Related: strategies for syncing libraries across Android and desktop/laptop devices. ...

https://diaspora.glasswings.com/posts/e76c078ba0544ad9
Original :: Wayback Machine :: Archive.Today

(2012-05-17 21:29)

I've been posting those in fragmenents by year as private posts to myself to facilitate both federation and archival of the content. In chunks as Diaspora* has a 2^16^ / 65,536 byte per-post size limit. It's a slow slog but I've only one more year (2021) to manually process at this point, with post counts numbering up to 535 per year.

The Internet Archive Wayback Machine (at Archive.org) accepts scripted archival requests


If you submit a URL in the form of https://web.archive.org/save/<URL>, the Wayback Machine will attempt to archive that URL.

This can be scripted for an unattended backup request if you can generate the set of URLs you want to save.

Using our previous example, the URL would be:

https://web.archive.org/save/https://pluspora.com/posts/64cc4c1076e5013a7342005056264835

Clicking that link will generate an archive request.

(IA limit how frequently such a request will be processed.)

Joindiaspora podmins discourage this practice. Among the more reasonable concerns raised is system load.

I suggest that if you do automate archival requests, as I have done, you set a rate-limit or sleep timer on your script. A request every few seconds should be viable. As a Bash "one-liner" reading from the file DIASPORA_EXTRACT.json.gz (change to match your own archive file), which logs progress to the timestamped file run-log with a YYYYMMDD-hms format, e.g., run-log.20220224-222158:
time zcat DIASPORA_EXTRACT.json.gz |
    jq -r '.user .posts[] | "https://joindiaspora.com/posts/\(.entity_data .guid )"' |
    xargs -P4 -n1 -t -r ~/bin/archive-url |
    tee run-log.$(date +%Y%m%d-%H%M%S)

archive-url is a Bash shell script:
\#!/bin/bash

url=${1}

echo -e "Archiving ${url} ... "
lynx -dump -nolist -width=1024 "https://web.archive.org/save/${url}"  |
    sed -ne '/[Ss]aving page now/,/^$/{/./s/^[  ]*//p;}' |
    grep 'Saving page now'

sleep 4

Note that this waits 4 seconds between requests (sleep 4), which limits itself to a maximum of 900 requests per hour. There is NO error detection and you should confirm that posts you think you archived actually are archived. (We can discuss methods for this in comments, I'm still working on how to achieve this.)

The script could be improved to only process public posts, something I need to look into. Submitting private posts won't result in their archival, but it's additional time and load.

There is no automated submission mechanism for Archive.Today of which I'm aware.

Appending .json to the end of a Diaspora* URL provides the raw JSON data for that post:


https://joindiaspora.com/posts/64cc4c1076e5013a7342005056264835.json

That can be further manipulated with tools, e.g., to extract original post or comment Markdown text, or other information. Using jq is useful for this as described in other posts under the # hashtag generally.

Notably:

As always: This is my best understanding


There are likely errors and omissions. Much of the behaviour and structure described is inferred. Corrections and additions are welcomed.

# # # # # # # # # #
in reply to Doc Edward Morbius

Finding your First Remote-Pod Followers
A critical question in salvaging content is determining what content was federated, and where. This determines which of your posts were federated and whether all comments can be found at a given pod.

To do this, you can look to see what remote profile(s) and pod(s) followed you earliest. If that pod remains active, there’s a good chance your full content and comments are probably fully represented there, after that initial follow date. ...
https://diaspora.glasswings.com/posts/feaae8207c8d013a5b1e448a5b29e257
in reply to Doc Edward Morbius

@Gary Hill Regarding your question here, at a post I'm blocked from commenting on, you should find this thread useful.

This post and discussion are specific tools and discoveries based on working with Diaspora* data exports and JSON tools over the past two months. That contrasts with the Q&A discussion of the related Diaspora Migration Tips and Questions Thread from which I've also mentioned you.

Please check the original copy on Glasswings to ensure you are seeing full comments, though diasp.org should be fully federated.

Generally:
  • The Diaspora* dev team have committed to creating a profile-import utility. That does not presently exist. If and when it does, you should be able to upload your content to new Pod(s).
  • To work with the archive yourself, both to see what's in it and to produce useful outputs, you'll want to use jq, a JSON parser and processor.
Using jq you can both see what is inside the archive and create outputs from it, such as lists of contacts and summaries of your posts and reshares.

I'd appreciate if you'd mention these posts in the updated announcement thread if you find this useful.
in reply to Doc Edward Morbius

The Wayback Machine APIs can be used to check for existence and date(s) of archived snapshots in the Internet Archive's Wayback Machine.

Quoting from the Archive:
The API can be used as follows:

http://archive.org/wayback/available?url=example.com

which might return:
{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
}

if the url is available. When available, the url is the link to the archived snapshot in the Wayback Machine At this time, archived_snapshots just returns a single closest snapshot, but additional snapshots may be added in the future.

If the url is not available (not archived or currently not accessible), the response will be:
{"archived_snapshots":{}}

https://archive.org/help/wayback_api.php
The query URLs can be constructed from the Diaspora* archive, fed to a web query via curl or wget, and logged, to see which of your posts have been captured.

Using this, I've determined that 1,932 snapshots of my 2,659 total Joindiaspora posts have been captured, for a 73% overall success rate.
in reply to Doc Edward Morbius

Joindiaspora Total Size


A comment by the Joindiaspora-Sunset account reveals the total size of the Joindiaspora data (largely text content) and data (largely image) storage.
[F]or reference, the static files alone (i.e. the photo uploads) are more than 500 GiB in total, and the database alone is larger than 50 GiB.
https://diaspora.glasswings.com/posts/537493b07655013ae0f352540086c3e0#eea925b08066013ae10352540086c3e0

That is for
- 2,966,990 posts
- 1,412039 comments
- 4,379,029 total content items.
- 519 MAU
- 1147 6-months actives

That's roughly 10 kB content per item (posts + comments).

And about 170 kB of graphical / file content per post (comments can have linked images, but not uploaded ones).

It's about 100 MB total data (text) storage per MAU, and 43 MB per six-months actives.

And about 1 GB per MAU / 440 MB per six-months actives in photos.

On average.