magic-wormhole/docs/dilation-protocol.md

# Dilation Internals

Wormhole dilation involves several moving parts. Both sides exchange messages
through the Mailbox server to coordinate the establishment of a more direct
connection. This connection might flow in either direction, so they trade
"connection hints" to point at potential listening ports. This process might
succeed in making multiple connections at about the same time, so one side
must select the best one to use, and cleanly shut down the others. To make
the dilated connection *durable*, this side must also decide when the
connection has been lost, and then coordinate the construction of a
replacement. Within this connection, a series of queued-and-acked subchannel
messages are used to open/use/close the application-visible subchannels.

## Versions and can-dilate

The Wormhole protocol includes a `versions` message sent immediately after
the shared PAKE key is established. This also serves as a key-confirmation
message, allowing each side to confirm that the other side knows the right
key. The body of the `versions` message is a JSON-formatted string with keys
that are available for learning the abilities of the peer. Dilation is
signaled by a key named `can-dilate`, whose value is a list of strings. Any
version present in both side's lists is eligible for use.

## Leaders and Followers

Each side of a Wormhole has a randomly-generated dilation `side` string (this
is included in the `please-dilate` message, and is independent of the
Wormhole's mailbox "side"). When the wormhole is dilated, the side with the
lexicographically-higher "side" value is named the "Leader", and the other
side is named the "Follower". The general wormhole protocol treats both sides
identically, but the distinction matters for the dilation protocol.

Both sides send a `please-dilate` as soon as dilation is triggered. Each side
discovers whether it is the Leader or the Follower when the peer's
"please-dilate" arrives. The Leader has exclusive control over whether a
given connection is considered established or not: if there are multiple
potential connections to use, the Leader decides which one to use, and the
Leader gets to decide when the connection is no longer viable (and triggers
the establishment of a new one).

The `please-dilate` includes a `use-version` key, computed as the "best"
version of the intersection of the two sides' abilities as reported in the
`versions` message. Both sides will use whichever `use-version` was specified
by the Leader (they learn which side is the Leader at the same moment they
learn the peer's `use-version` value). If the Follower cannot handle the
`use-version` value, dilation fails (this shouldn't happen, as the Leader
knew what the Follower was and was not capable of before sending that
message).

## Connection Layers

We describe the protocol as a series of layers. Messages sent on one layer
may be encoded or transformed before being delivered on some other layer.

L1 is the mailbox channel (queued store-and-forward messages that always go
to the mailbox server, and then are forwarded to other clients subscribed to
the same mailbox). Both clients remain connected to the mailbox server until
the Wormhole is closed. They send DILATE-n messages to each other to manage
the dilation process, including records like `please`, `connection-hints`,
`reconnect`, and `reconnecting`.

L2 is the set of competing connection attempts for a given generation of
connection. Each time the Leader decides to establish a new connection, a new
generation number is used. Hopefully these are direct TCP connections between
the two peers, but they may also include connections through the transit
relay. Each connection must go through an encrypted handshake process before
it is considered viable. Viable connections are then submitted to a selection
process (on the Leader side), which chooses exactly one to use, and drops the
others. It may wait an extra few seconds in the hopes of getting a "better"
connection (faster, cheaper, etc), but eventually it will select one.

L3 is the current selected connection. There is one L3 for each generation.
At all times, the wormhole will have exactly zero or one L3 connection. L3 is
responsible for the selection process, connection monitoring/keepalives, and
serialization/deserialization of the plaintext frames. L3 delivers decoded
frames and connection-establishment events up to L4.

L4 is the persistent higher-level channel. It is created as soon as the first
L3 connection is selected, and lasts until wormhole is closed entirely. L4
contains OPEN/DATA/CLOSE/ACK messages: OPEN/DATA/CLOSE have a sequence number
(scoped to the L4 connection and the direction of travel), and the ACK
messages reference those sequence numbers. When a message is given to the L4
channel for delivery to the remote side, it is always queued, then
transmitted if there is an L3 connection available. This message remains in
the queue until an ACK is received to retire it. If a new L3 connection is
made, all queued messages will be re-sent (in seqnum order).

L5 are subchannels. There is one pre-established subchannel 0 known as the
"control channel", which does not require an OPEN message. All other
subchannels are created by the receipt of an OPEN message with the subchannel
number. DATA frames are delivered to a specific subchannel. When the
subchannel is no longer needed, one side will invoke the ``close()`` API
(``loseConnection()`` in Twisted), which will cause a CLOSE message to be
sent, and the local L5 object will be put into the "closing "state. When the
other side receives the CLOSE, it will send its own CLOSE for the same
subchannel, and fully close its local object (``connectionLost()``). When the
first side receives CLOSE in the "closing" state, it will fully close its
local object too.

All L5 subchannels will be paused (``pauseProducing()``) when the L3
connection is paused or lost. They are resumed when the L3 connection is
resumed or reestablished.

## Initiating Dilation

Dilation is triggered by calling the `w.dilate()` API. This returns a
Deferred that will fire once the first L3 connection is established. It fires
with a 3-tuple of endpoints that can be used to establish subchannels, or an
error if dilation is not possible. If the other side's `versions` message
indicates that it does not support dilation, the Deferred will errback with
an `OldPeerCannotDilateError`.

For dilation to succeed, both sides must call `w.dilate()`, since the
resulting endpoints are the only way to access the subchannels. If the other
side is capable of dilation, but never calls `w.dilate()`, the Deferred will
never fire.

The L1 (mailbox) path is used to deliver dilation requests and connection
hints. The current mailbox protocol uses named "phases" to distinguish
messages (rather than behaving like a regular ordered channel of arbitrary
frames or bytes), and all-number phase names are reserved for application
data (sent via `w.send_message()`). Therefore the dilation control messages
use phases named `DILATE-0`, `DILATE-1`, etc. Each side maintains its own
counter, so one side might be up to e.g. `DILATE-5` while the other has only
gotten as far as `DILATE-2`. This effectively creates a pair of
unidirectional streams of `DILATE-n` messages, each containing one or more
dilation record, of various types described below. Note that all phases
beyond the initial VERSION and PAKE phases are encrypted by the shared
session key.

A future mailbox protocol might provide a simple ordered stream of typed
messages, with application records and dilation records mixed together.

Each `DILATE-n` message is a JSON-encoded dictionary with a `type` field that
has a string value. The dictionary will have other keys that depend upon the
type.

`w.dilate()` triggers transmission of a `please` (i.e. "please dilate")
record with a set of versions that can be accepted. Versions use strings,
rather than integers, to support experimental protocols, however there is
still a total ordering of preferability.

```
{ "type": "please",
  "side": "abcdef",
  "accepted-versions": ["1"]
}
```

If one side receives a `please` before `w.dilate()` has been called locally,
the contents are stored in case `w.dilate()` is called in the future. Once
both `w.dilate()` has been called and the peer's `please` has been received,
the side determines whether it is the Leader or the Follower. Both sides also
compare `accepted-versions` fields to choose the best mutually-compatible
version to use: they should always pick the same one.

Then both sides begin the connection process for generation 1 by opening
listening sockets and sending `connection-hint` records for each one. After a
slight delay they will also open connections to the Transit Relay of their
choice and produce hints for it too. The receipt of inbound hints (on both
sides) will trigger outbound connection attempts.

Some number of these connections may succeed, and the Leader decides which to
use (via an in-band signal on the established connection). The others are
dropped.

If something goes wrong with the established connection and the Leader
decides a new one is necessary, the Leader will send a `reconnect` message.
This might happen while connections are still being established, or while the
Follower thinks it still has a viable connection (the Leader might observe
problems that the Follower does not), or after the Follower thinks the
connection has been lost. In all cases, the Leader is the only side which
should send `reconnect`. The state machine code looks the same on both sides,
for simplicity, but one path on each side is never used.

Upon receiving a `reconnect`, the Follower should stop any pending connection
attempts and terminate any existing connections (even if they appear viable).
Listening sockets may be retained, but any previous connection made through
them must be dropped.

Once all connections have stopped, the Follower should send a `reconnecting`
message, then start the connection process for the next generation, which
will send new `connection-hint` messages for all listening sockets.

Generations are non-overlapping. The Leader will drop all connections from
generation 1 before sending the `reconnect` for generation 2, and will not
initiate any gen-2 connections until it receives the matching `reconnecting`
from the Follower. The Follower must drop all gen-1 connections before it
sends the `reconnecting` response (even if it thinks they are still
functioning: if the Leader thought the gen-1 connection still worked, it
wouldn't have started gen-2).

(TODO: what about a follower->leader connection that was started before
start-dilation is received, and gets established on the Leader side after
start-dilation is sent? the follower will drop it after it receives
start-dilation, but meanwhile the leader may accept it as gen2)

(probably need to include the generation number in the handshake, or in the
derived key)

(TODO: reduce the number of round-trip stalls here, I've added too many)

Each side is in the "connecting" state (which encompasses both making
connection attempts and having an established connection) starting with the
receipt of a `please-dilate` message and a local `w.dilate()` call. The
Leader remains in that state until it abandons the connection and sends a
`reconnect` message, at which point it remains in the "flushing" state until
the Follower's `reconnecting` message is received. The Follower remains in
"connecting" until it receives `reconnect`, then it stays in "dropping" until
it finishes halting all outstanding connections, after which it sends
`reconnecting` and switches back to "connecting".

"Connection hints" are type/address/port records that tell the other side of
likely targets for L2 connections. Both sides will try to determine their
external IP addresses, listen on a TCP port, and advertise `(tcp,
external-IP, port)` as a connection hint. The Transit Relay is also used as a
(lower-priority) hint. These are sent in `connection-hint` records, which can
be sent any time after both sending and receiving a `please` record. Each
side will initiate connections upon receipt of the hints.

```
{ "type": "connection-hints",
  "hints": [ ... ]
}
```

Hints can arrive at any time. One side might immediately send hints that can
be computed quickly, then send additional hints later as they become
available. For example, it might enumerate the local network interfaces and
send hints for all of the LAN addresses first, then send port-forwarding
(UPnP) requests to the local router. When the forwarding is established
(providing an externally-visible IP address and port), it can send additional
hints for that new endpoint. If the other peer happens to be on the same LAN,
the local connection can be established without waiting for the router's
response.


### Connection Hint Format

Each member of the `hints` field describes a potential L2 connection target
endpoint, with an associated priority and a set of hints.

The priority is a number (positive or negative float), where larger numbers
indicate that the client supplying that hint would prefer to use this
connection over others of lower number. This indicates a sense of cost or
performance. For example, the Transit Relay is lower priority than a direct
TCP connection, because it incurs a bandwidth cost (on the relay operator),
as well as adding latency.

Each endpoint has a set of hints, because the same target might be reachable
by multiple hints. Once one hint succeeds, there is no point in using the
other hints.

TODO: think this through some more. What's the example of a single endpoint
reachable by multiple hints? Should each hint have its own priority, or just
each endpoint?

## L2 protocol

Upon ``connectionMade()``, both sides send their handshake message. The
Leader sends "Magic-Wormhole Dilation Handshake v1 Leader\n\n". The Follower
sends "Magic-Wormhole Dilation Handshake v1 Follower\n\n". This should
trigger an immediate error for most non-magic-wormhole listeners (e.g. HTTP
servers that were contacted by accident). If the wrong handshake is received,
the connection will be dropped. For debugging purposes, the node might want
to keep looking at data beyond the first incorrect character and log
a few hundred characters until the first newline.

Everything beyond that point is a Noise protocol message, which consists of a
4-byte big-endian length field, followed by the indicated number of bytes.
This uses the `NNpsk0` pattern with the Leader as the first party ("-> psk,
e" in the Noise spec), and the Follower as the second ("<- e, ee"). The
pre-shared-key is the "dilation key", which is statically derived from the
master PAKE key using HKDF. Each L2 connection uses the same dilation key,
but different ephemeral keys, so each gets a different session key.

The Leader sends the first message, which is a psk-encrypted ephemeral key.
The Follower sends the next message, its own psk-encrypted ephemeral key.
These two messages are known as "handshake messages" in the Noise protocol,
and must be processed in a specific order (the Leader must not accept the
Follower's message until it has generated its own). Noise allows handshake
messages to include a payload, but we do not use this feature.

All subsequent messages as known as "Noise transport messages", and use
independent channels for each direction, so they no longer have ordering
dependencies. Transport messages are encrypted by the shared key, in a form
that evolves as more messages are sent.

The Follower's first transport message is an empty packet, which we use as a
"key confirmation message" (KCM).

The Leader doesn't send a transport message right away: it waits to see the
Follower's KCM, which indicates this connection is viable (i.e. the Follower
used the same dilation key as the Leader, which means they both used the same
wormhole code).

The Leader delivers the now-viable protocol object to the L3 manager, which
will decide which connection to select. When some L2 connection is selected
to be the new L3, the Leader finally sends an empty KCM of its own over that
L2, to let the Follower know which connection has been selected. All other L2
connections (either viable or still in handshake) are dropped, and all other
connection attempts are cancelled. All listening sockets may or may not be
shut down (TODO: think about it).

After sending their KCM, the Follower will wait for either an empty KCM (at
which point the L2 connection is delivered to the Dilation manager as the new
L3), a disconnection, or an invalid message (which causes the connection to
be dropped). Other connections and/or listening sockets are stopped.

Internally, the L2Protocol object manages the Noise session itself. It knows
(via a constructor argument) whether it is on the Leader or Follower side,
which affects both the role is plays in the Noise pattern, and the reaction
to receiving the handshake message / ephemeral key (for which only the
Follower sends an empty KCM message). After that, the L2Protocol notifies the
L3 object in three situations:

* the Noise session produces a valid decrypted frame (for Leader, this
  includes the Follower's KCM, and thus indicates a viable candidate for
  connection selection)
* the Noise session reports a failed decryption
* the TCP session is lost

All notifications include a reference to the L2Protocol object (`self`). The
L3 object uses this reference to either close the connection (for errors or
when the selection process chooses someone else), to send the KCM message
(after selection, only for the Leader), or to send other L4 messages. The L3
object will retain a reference to the winning L2 object.

## L3 protocol

The L3 layer is responsible for connection selection, monitoring/keepalives,
and message (de)serialization. Framing is handled by L2, so the inbound L3
codepath receives single-message bytestrings, and delivers the same down to
L2 for encryption, framing, and transmission.

Connection selection takes place exclusively on the Leader side, and includes
the following:

* receipt of viable L2 connections from below (indicated by the first valid
  decrypted frame received for any given connection)
* expiration of a timer
* comparison of TBD quality/desirability/cost metrics of viable connections
* selection of winner
* instructions to losing connections to disconnect
* delivery of KCM message through winning connection
* retain reference to winning connection

On the Follower side, the L3 manager just waits for the first connection to
receive the Leader's KCM, at which point it is retained and all others are
dropped.

The L3 manager knows which "generation" of connection is being established.
Each generation uses a different dilation key (?), and is triggered by a new
set of L1 messages. Connections from one generation should not be confused
with those of a different generation.

Each time a new L3 connection is established, the L4 protocol is notified. It
will will immediately send all the L4 messages waiting in its outbound queue.
The L3 protocol simply wraps these in Noise frames and sends them to the
other side.

The L3 manager monitors the viability of the current connection, and declares
it as lost when bidirectional traffic cannot be maintained. It uses PING and
PONG messages to detect this. These also serve to keep NAT entries alive,
since many firewalls will stop forwarding packets if they don't observe any
traffic for e.g. 5 minutes.

Our goals are:

* don't allow more than 30? seconds to pass without at least *some* data
  being sent along each side of the connection
* allow the Leader to detect silent connection loss within 60? seconds
* minimize overhead

We need both sides to:

* maintain a 30-second repeating timer
* set a flag each time we write to the connection
* each time the timer fires, if the flag was clear then send a PONG,
  otherwise clear the flag

In addition, the Leader must:

* run a 60-second repeating timer (ideally somewhat offset from the other)
* set a flag each time we receive data from the connection
* each time the timer fires, if the flag was clear then drop the connection,
  otherwise clear the flag

In the future, we might have L2 links that are less connection-oriented,
which might have a unidirectional failure mode, at which point we'll need to
monitor full roundtrips. To accomplish this, the Leader will send periodic
unconditional PINGs, and the Follower will respond with PONGs. If the
Leader->Follower connection is down, the PINGs won't arrive and no PONGs will
be produced. If the Follower->Leader direction has failed, the PONGs won't
arrive. The delivery of both will be delayed by actual data, so the timeouts
should be adjusted if we see regular data arriving.

If the connection is dropped before the wormhole is closed (either the other
end explicitly dropped it, we noticed a problem and told TCP to drop it, or
TCP noticed a problem itself), the Leader-side L3 manager will initiate a
reconnection attempt. This uses L1 to send a new DILATE message through the
mailbox server, along with new connection hints. Eventually this will result
in a new L3 connection being established.

Finally, L3 is responsible for message serialization and deserialization. L2
performs decryption and delivers plaintext frames to L3. Each frame starts
with a one-byte type indicator. The rest of the message depends upon the
type:

* 0x00 PING, 4-byte ping-id
* 0x01 PONG, 4-byte ping-id
* 0x02 OPEN, 4-byte subchannel-id, 4-byte seqnum
* 0x03 DATA, 4-byte subchannel-id, 4-byte seqnum, variable-length payload
* 0x04 CLOSE, 4-byte subchannel-id, 4-byte seqnum
* 0x05 ACK, 4-byte response-seqnum

All seqnums are big-endian, and are provided by the L4 protocol. The other
fields are arbitrary and not interpreted as integers. The subchannel-ids must
be allocated by both sides without collision, but otherwise they are only
used to look up L5 objects for dispatch. The response-seqnum is always copied
from the OPEN/DATA/CLOSE packet being acknowledged.

L3 consumes the PING and PONG messages. Receiving any PING will provoke a
PONG in response, with a copy of the ping-id field. The 30-second timer will
produce unprovoked PONGs with a ping-id of all zeros. A future viability
protocol will use PINGs to test for roundtrip functionality.

All other messages (OPEN/DATA/CLOSE/ACK) are deserialized and delivered
"upstairs" to the L4 protocol handler.

The current L3 connection's `IProducer`/`IConsumer` interface is made
available to the L4 flow-control manager.

## L4 protocol

The L4 protocol manages a durable stream of OPEN/DATA/CLOSE/ACK messages.
Since each will be enclosed in a Noise frame before they pass to L3, they do
not need length fields or other framing.

Each OPEN/DATA/CLOSE has a sequence number, starting at 0, and monotonically
increasing by 1 for each message. Each direction has a separate number space.

The L4 manager maintains a double-ended queue of unacknowledged outbound
messages. Subchannel activity (opening, closing, sending data) cause messages
to be added to this queue. If an L3 connection is available, these messages
are also sent over that connection, but they remain in the queue in case the
connection is lost and they must be retransmitted on some future replacement
connection. Messages stay in the queue until they can be retired by the
receipt of an ACK with a matching response-sequence-number. This provides
reliable message delivery that survives the L3 connection being replaced.

ACKs are not acked, nor do they have seqnums of their own. Each inbound side
remembers the highest ACK it has sent, and ignores incoming OPEN/DATA/CLOSE
messages with that sequence number or higher. This ensures in-order
at-most-once processing of OPEN/DATA/CLOSE messages.

Each inbound OPEN message causes a new L5 subchannel object to be created.
Subsequent DATA/CLOSE messages for the same subchannel-id are delivered to
that object.

Each time an L3 connection is established, the side will immediately send all
L4 messages waiting in the outbound queue. A future protocol might reduce
this duplication by including the highest received sequence number in the L1
PLEASE-DILATE message, which would effectively retire queued messages before
initiating the L2 connection process. On any given L3 connection, all
messages are sent in-order. The receipt of an ACK for seqnum `N` allows all
messages with `seqnum <= N` to be retired.

The L4 layer is also responsible for managing flow control among the L3
connection and the various L5 subchannels.

## L5 subchannels

The L5 layer consists of a collection of "subchannel" objects, a dispatcher,
and the endpoints that provide the Twisted-flavored API.

Other than the "control channel", all subchannels are created by a client
endpoint connection API. The side that calls this API is named the Initiator,
and the other side is named the Acceptor. Subchannels can be initiated in
either direction, independent of the Leader/Follower distinction. For a
typical file-transfer application, the subchannel would be initiated by the
side seeking to send a file.

Each subchannel uses a distinct subchannel-id, which is a four-byte
identifier. Both directions share a number space (unlike L4 seqnums), so the
rule is that the Leader side sets the last bit of the last byte to a 1, while
the Follower sets it to a 0. These are not generally treated as integers,
however for the sake of debugging, the implementation generates them with a
simple big-endian-encoded counter (`counter*2+1` for the Leader,
`counter*2+2` for the Follower, with id `0` reserved for the control
channel).

When the `client_ep.connect()` API is called, the Initiator allocates a
subchannel-id and sends an OPEN. It can then immediately send DATA messages
with the outbound data (there is no special response to an OPEN, so there is
no need to wait). The Acceptor will trigger their `.connectionMade` handler
upon receipt of the OPEN.

Subchannels are durable: they do not close until one side calls
`.loseConnection` on the subchannel object (or the enclosing Wormhole is
closed). Either the Initiator or the Acceptor can call `.loseConnection`.
This causes a CLOSE message to be sent (with the subchannel-id). The other
side will send its own CLOSE message in response. Each side will signal the
`.connectionLost()` event upon receipt of a CLOSE.

There is no equivalent to TCP's "half-closed" state, however if only one side
calls `close()`, then all data written before that call will be delivered
before the other side observes `.connectionLost()`. Any inbound data that was
queued for delivery before the other side sees the CLOSE will still be
delivered to the side that called `close()` before it sees
`.connectionLost()`. Internally, the side which called `.loseConnection` will
remain in a special "closing" state until the CLOSE response arrives, during
which time DATA payloads are still delivered. After calling `close()` (or
receiving CLOSE), any outbound `.write()` calls will trigger an error.

DATA payloads that arrive for a non-open subchannel are logged and discarded.

This protocol calls for one OPEN and two CLOSE messages for each subchannel,
with some arbitrary number of DATA messages in between. Subchannel-ids should
not be reused (it would probably work, the protocol hasn't been analyzed
enough to be sure).

The "control channel" is special. It uses a subchannel-id of all zeros, and
is opened implicitly by both sides as soon as the first L3 connection is
selected. It is routed to a special client-on-both-sides endpoint, rather
than causing the listening endpoint to accept a new connection. This avoids
the need for application-level code to negotiate who should be the one to
open it (the Leader/Follower distinction is private to the Wormhole
internals: applications are not obligated to pick a side).

OPEN and CLOSE messages for the control channel are logged and discarded. The
control-channel client endpoints can only be used once, and does not close
until the Wormhole itself is closed.

Each OPEN/DATA/CLOSE message is delivered to the L4 object for queueing,
delivery, and eventual retirement. The L5 layer does not keep track of old
messages.

### Flow Control

Subchannels are flow-controlled by pausing their writes when the L3
connection is paused, and pausing the L3 connection when the subchannel
signals a pause. When the outbound L3 connection is full, *all* subchannels
are paused. Likewise the inbound connection is paused if *any* of the
subchannels asks for a pause. This is much easier to implement and improves
our utilization factor (we can use TCP's window-filling algorithm, instead of
rolling our own), but will block all subchannels even if only one of them
gets full. This shouldn't matter for many applications, but might be
noticeable when combining very different kinds of traffic (e.g. a chat
conversation sharing a wormhole with file-transfer might prefer the IM text
to take priority).

Each subchannel implements Twisted's `ITransport`, `IProducer`, and
`IConsumer` interfaces. The Endpoint API causes a new `IProtocol` object to
be created (by the caller's factory) and glued to the subchannel object in
the `.transport` property, as is standard in Twisted-based applications.

All subchannels are also paused when the L3 connection is lost, and are
unpaused when a new replacement connection is selected.