Advanced topics
When are events published?
Execution events are recorded roughly "as they are happening" inside the
execution daemon: you see a BLOCK_START
event at roughly the same moment
that the execution daemon beings processing a new block, followed by the
start of the first transaction (a TXN_HEADER_START
event) about 1 millisecond
later. Most transaction-related events are recorded less than one
microsecond after the transaction they describe has completed.
Execution of a typical transaction will emit a few dozen events, but large
transaction can be emit hundreds of events. The TXN_EVM_OUTPUT
event --
which is recorded as soon as the transaction is finished -- provides a summary
accounting of how many more events related to that transaction will follow
(how many logs, how many call frames, etc.), so that any memory can be
preallocated. Such an event is referred as a "header event" in the
documentation: an event whose content describes the number of subsequent,
related events that will be recorded.
All these events are recorded as soon as the transaction is "committed" to the currently-executing block. This happens before the block has finished executing, and should not be confused with the unrelated notion of "commitment" in the consensus algorithm. Although there are complex speculative execution optimizations inside the execution daemon, the recording of a transaction takes place when all work on a particular transaction has finished. This is referred to as "transaction commit" time.
This is a different than the block-at-a-time style update you would see in, for example, the Geth real-time events WebSocket protocol (which our RPC server also supports). Certain properties of the block (its hash, its state root, etc.) are not known at the time you see the transactions. If you would like block-at-a-time updates, the Rust SDK contains some utilities which will aggregate the events back into complete, block-oriented updates.
One thing to be careful of: although transactions are always committed to a block in index order, they might be recorded out of order. That is, you must assume that the set of execution events that make up transactions 2 and 3 could be "mixed together" in any order. This is because of optimizations in the event recording code path.
However, for a particular transaction (e.g., transaction 3) events pertaining to that transaction are always recorded in the same order: first all of the logs, then all the call frames, then all the state access records. Each of these is recorded in index order, i.e., log 2 is always recorded before log 3.
Sequence numbers and the lifetime detection algorithm
All event descriptors are tagged with an incrementing sequence number starting at 1. Sequence numbers are 64-bit unsigned integers which do not repeat unless the execution daemon is restarted. Zero is not valid sequence number.
Also note that the sequence number modulo the descriptor array size equals the array index where the next event descriptor will be located. This is shown below with a concrete example where the descriptor array size is 64. Note that the last valid index in the array is 63, then access wraps around to the beginning of the array at index 0.
◇ │ ╔═...═════════════════════════Event descriptor array═══╬═══════════════════...═╗ ║ │ ║ ║ ┌─Event────────┐┌─Event────────┐┌─Event────────┐ │ ┌─Event───── ────┐ ║ ║ │ ││ ││ │ │ │ │ ║ ║ │ seqnum = 318 ││ seqnum = 319 ││ seqnum = 320 │ │ │ seqnum = 256 │ ║ ║ │ ││ ││ │ │ │ │ ║ ║ └──────────────┘└──▲───────────┘└──────────────┘ │ └───────────────┘ ║ ║ 61 │ 62 63 │ 0 ║ ╚═...════════════════════╬═════════════════════════════╬═══════════════════...═╝ │ │ ■ ◇ Next event Ring buffer wrap-around to ┌──────────────────────────────┐ zero is here │last read sequence number │ │(last_seqno) is initially 318 │ └──────────────────────────────┘
In this example:
-
We keep track of the "last seen sequence number" (
last_seqno
) which has value318
to start; being the "last" sequence number means we have already finished reading the event with this sequence number, which lives at array index61
-
318 % 64
is62
, so we will find the potential next event at that index if it has been produced -
Observe that the sequence number of the item at index
62
is319
, which is the last seen sequence number plus 1 (319 == 318 + 1
). This means that event319
has been produced, and its data can be safely read from that slot -
When we're ready to advance to the next event, the last seen sequence number will be incremented to
319
. As before, we can find the next event (if it has been produced) at319 % 64 == 63
. The event at this index bears the sequence number320
, which is again the last seen sequence number + 1, therefore this event is also valid -
When advancing a second time, we increment the last seen sequence number to
320
. This time, the event at index320 % 64 == 0
is not321
, but is a smaller number,256
. This means the next event has not been written yet, and we are seeing an older event in the same slot. We've seen all of the currently available events, and will need to check again later once a new event is written -
Alternatively we might have seen a much larger sequence number, like
384
(320 + 64
). This would mean that we consumed events too slowly, so slowly that the 63 events in the range[321, 384)
were produced in the meantime. These were subsequently overwritten, and are now lost. They can be replayed using services external to event ring API, but within the event ring API itself there is no way to recover them
Lifetime of an event payload, zero copy vs. memcpy APIs
Because of the descriptor overwrite behavior, an event descriptor might be overwritten by the execution daemon while a reader is still examining its data. To deal with this, the reader API makes a copy of the event descriptor. If it detects that the event descriptor changed during the copy operation, it reports a gap. Copying an event descriptor is fast, because it is only a single cache line in size.
This is not the case for event payloads, which could potentially be very
large. This means a memcpy(3)
of an event payload could be expensive, and
it would be advantageous to read the payload bytes directly from the payload
buffer's shared memory segment: a "zero-copy" API. This exposes the user to
the possibility that the event payload could be overwritten while still
using it, so two solutions are provided:
-
A simple detection mechanism allows payload overwrite to be detected at any time: the writer keeps track of the minimum payload offset value (before modular arithmetic is applied) that is still valid. If the offset value in the event descriptor is smaller than this, it is no longer safe to read the event payload
-
A payload
memcpy
-style API is also provided. This uses the detection mechanism above in the following way: first, the payload is copied to a user-provided buffer. Before returning, it checks if the lifetime remained valid after the copy finished. If so, then an overwrite did not occur during the copy, so the copy must be valid. Otherwise, the copy is invalid
The reason to prefer the zero-copy APIs is that they do less work. The reason to prefer memcpy APIs is that it is not always easy (or possible) to "undo" the work you did if you find out later that the event payload was corrupted by an overwrite while you were working with it. The most logical thing to do in that case is start by copying the data to stable location, and if the copy isn't valid, to never start the operation.
An example user of the zero-copy API is the eventwatch
example C program,
which can turn events into printed strings that are sent to stdout
. The
expensive work of formatting a hexdump of the event payload is performed
using the original payload memory. If an overwrite happened during the
string formatting, the hexdump output buffer will be wrong, but that is OK:
it will not be sent to stdout
until the end. Once formatting is complete,
eventwatch
checks if the payload expired and if so, writes an error to
stderr
instead of writing the formatted buffer to stdout
.
Whether you should copy or not depends on the characteristics of the reader, namely how easily it can deal with "aborting" processing.