Write some more docs

This commit is contained in:
Andrew Godwin 2022-11-23 13:05:14 -07:00
parent c8ad22a704
commit 807d546b12
5 changed files with 130 additions and 61 deletions

63
docs/domains.rst Normal file
View file

@ -0,0 +1,63 @@
Domains
=======
One of our key design features in Takahē is that we support multiple different
domains for ActivityPub users to be under.
As a server administrator, you do this by specifying one or more Domains on
your server that users can make Identities (posting accounts) under.
Domains can take two forms:
* **Takahē lives on and serves the domain**. In this case, you just set the domain
to point to Takahē and ensure you have a matching domain record; ignore the
"service domain" setting.
* **Takahē handles accounts under the domain but does not live on it**. For
example, you wanted to service the ``@andrew@aeracode.org`` handle, but there
is already a site on ``aeracode.org``, and Takahē instead must live elsewhere
(e.g. ``fedi.aeracode.org``).
In this second case, you need to have a *service domain* - a place where
Takahē and the Actor URIs for your users live, but which is different to your
main domain you'd like the account handles to contain.
To set this up, you need to:
* Choose a service domain and point it at Takahē. *You cannot change this
domain later without breaking everything*, so choose very wisely.
* On your primary domain, forward the URLs ``/.well-known/webfinger``,
``/.well-known/nodeinfo`` and ``/.well-known/host-meta`` to Takahē.
* Set up a domain with these separate primary and service domains in its
record.
Technical Details
-----------------
At its core, ActivityPub is a system built around URIs; the
``@username@domain.tld`` format is actually based on Webfinger, a different
standard, and merely used to discover the Actor URI for someone.
Making a system that allows any Webfinger handle to be accepted is relatively
easy, but unfortunately this is only how users are discovered via mentions
and search; when an incoming Follow comes in, or a Post is boosted onto your
timeline, you have to discover the user's Webfinger handle
*from their Actor URI* and this is where it gets tricky.
Mastodon, and from what we can tell most other implementations, do this by
taking the ``preferredUsername`` field from the Actor object, the domain from
the Actor URI, and webfinger that combination of username and domain. This
means that the domain you serve the Actor URI on must uniquely map to a
Webfinger handle domain - they don't need to match, but they do need to be
translatable into one another.
Takahē handles all this internally, however, with a concept of Domains. Each
domain has a primary (display) domain name, and an optional "service" domain;
the primary domain is what we will use for the user's Webfinger handle, and
the service domain is what their Actor URI is served on.
We look at ``HOST`` headers on incoming requests to match users to their
domains, though for Actor URIs we ensure the domain is in the URI anyway.

View file

@ -15,4 +15,5 @@ in alpha. For more information about Takahē, see
:caption: Contents: :caption: Contents:
installation installation
principles domains
stator

View file

@ -14,6 +14,7 @@ Prerequisites
* SSL support (Takahē *requires* HTTPS) * SSL support (Takahē *requires* HTTPS)
* Something that can run Docker/OCI images * Something that can run Docker/OCI images
* A PostgreSQL 14 (or above) database * A PostgreSQL 14 (or above) database
* Hosting/reverse proxy that passes the ``HOST`` header down to Takahē
* One of these to store uploaded images and media: * One of these to store uploaded images and media:
* Amazon S3 * Amazon S3
@ -28,7 +29,7 @@ This means that a "serverless" platform like AWS Lambda or Google Cloud Run is
not enough by itself; while you can use these to serve the web pages if you not enough by itself; while you can use these to serve the web pages if you
like, you will need to run the Stator runner somewhere else as well. like, you will need to run the Stator runner somewhere else as well.
The flagship Takahē instance, [takahe.social](https://takahe.social), runs The flagship Takahē instance, `takahe.social <https://takahe.social>`_, runs
inside of Kubernetes, with one Deployment for the webserver and one for the inside of Kubernetes, with one Deployment for the webserver and one for the
Stator runner. Stator runner.

View file

@ -1,59 +0,0 @@
Design Principles
=================
Takahē is somewhat opinionated in its design goals, which are:
* Simplicity of maintenance and operation
* Multiple domain support
* Asychronous Python core
* Low-JS user interface
These are explained more below, but it's important to stress the one thing we
are not aiming for - scalability.
If we wanted to build a system that could handle hundreds of thousands of
accounts on a single server, it would be built very differently - queues
everywhere as the primary communication mechanism, most likely - but we're
not aiming for that.
Our final design goal is for around 10,000 users to work well, provided you do
some PostgreSQL optimisation. It's likely the design will work beyond that,
but we're not going to put any specific effort towards it.
After all, if you want to scale in a federated system, you can always launch
more servers. We'd rather work towards the ability to share moderation and
administration workloads across servers rather than have one giant big one.
Simplicity Of Maintenance
-------------------------
It's important that, when running a social networking server, you have as much
time to focus on moderation and looking after your users as you can, rather
than trying to be an SRE.
To this end, we use our deliberate design aim of "small to medium size" to try
and keep the infrastructure simple - one set of web servers, one set of task
runners, and a PostgreSQL database.
The task system (which we call Stator) is not based on a task queue, but on
a state machine per type of object - which have retry logic built in. The
system continually examines every object to see if it can progress its state
by performing an action, which is not quite as *efficient* as using a queue,
but recovers much more easily and doesn't get out of sync.
Multiple Domain Support
-----------------------
TODO
Asynchronous Python
-------------------
TODO
Low-JS User Interface
---------------------

63
docs/stator.rst Normal file
View file

@ -0,0 +1,63 @@
Stator
======
Takahē's background task system is called Stator, and rather than being a
transitional task queue, it is instead a *reconciliation loop* system; the
workers look for objects that could have actions taken, try to take them, and
update them if successful.
As someone running Takahē, the most important aspects of this are:
* You have to run at least one Stator worker to make things like follows,
posting, and timelines work.
* You can run as many workers as you want; there is a locking system to ensure
they can coexist.
* You can get away without running any workers for a few minutes; the server
will continue to accept posts and follows from other servers, and will
process them when a worker comes back up.
* There is no separate queue to run, flush or replay; it is all stored in the
main database.
* If all your workers die, just restart them, and within a few minutes the
existing locks will time out and the system will recover itself and process
everything that's pending.
You run a worker via the command ``manage.py runstator``. It will run forever
until it is killed; send SIGINT (Ctrl-C) to it once to have it enter graceful
shutdown, and a second time to force exiting immediately.
Technical Details
-----------------
Each object managed by Stator has a set of extra columns:
* ``state``, the name of a state in a state machine
* ``state_ready``, a boolean saying if it's ready to have a transition tried
* ``state_changed``, when it entered into its current state
* ``state_attempted``, when a transition was last attempted
* ``state_locked_until``, when the entry is locked by a worker until
They also have an associated state machine which is a subclass of
``stator.graph.StateGraph``, which will define a series of states, the
possible transitions between them, and handlers that run for each state to see
if a transition is possible.
An object becoming ready for execution happens first:
* If it's just entered into a new state, or just created, it is marked ready.
* If ``state_attempted`` is far enough in the past (based on the ``try_interval``
of the current state), a small scheduling loop marks it as ready.
Then, in the main fast loop of the worker, it:
* Selects an item with ``state_ready`` that is in a state it can handle (some
states are "externally progressed" and will not have handlers run)
* Fires up a coroutine for that handler and lets it run
* When that coroutine exits, sees if it returned a new state name and if so,
transitions the object to that state.
* If that coroutine errors or exits with ``None`` as a return value, it marks
down the attempt and leaves the object to be rescheduled after its ``try_interval``.