gstreamer/docs/design/time_notes.md
Henry Wilkes e142f49177 docs: add some notes on Time in GES
These notes cover time coordinates in GES, time effects, time
translations.

It also goes into why keyframes will not work with non-linear time
effects.

Part-of: <https://gitlab.freedesktop.org/gstreamer/gst-editing-services/-/merge_requests/177>
2020-05-25 11:20:38 +01:00

24 KiB

Time Notes

Some notes on time coordinates and time effects in GES.

Time Coordinate Definitions

A timeline will have a single time coordinate, which runs from 0 onwards in GstClockTime. Each track in the timeline will share the same time.

For a given track, at any given timeline time time, we have a stack of GESTrackElements whose interval [start, start + duration] contains time. The elements are linked in order or priority. Each element will have four time coordinates per each unique stack it is part of:

  • external sink coordinates: the coordinates used at the boundary between the upstream element and itself. This is the external source coordinates of the upstream element minus (downstream-start - upstream-start). If it has no upstream element, these coordinates do not exist.
  • external source coordinates: the coordinates used at the boundary between the downstream element and itself. This is the external sink coordinates of the downstream element minus (upstream-start - downstream-start). If it has no downstream element, these coordinates can be translated to the timeline coordinates by adding the start of the element.
  • internal sink coordinates: the coordinates used for the sink of the first internal GstElement. This is the external sink coordinates plus in-point.
  • internal source coordinates: the coordinates used at the source of the last internal GstElement. This will differ from the internal sink coordinates if one of the GstElements applies a rate-changing effects. This is the external source coordinates plus in-point. Note that an element that changes the consumption rate should always have its in-point set to 0. This is because nleghostpad is not able to 'undo' this shift by in-point at the opposite pad.

The following diagram shows where these coordinates are used, and how they are transformed. Below we have a GESSource, followed by a GESOperation that does not perform any rate-changing effect, followed by a GESEffect that does apply a rate-changing effect (and so its in-point is 0).

time                                        coordinate             coordinate
coords                  object              transformation         transformation
used                                        upwards                downwards
______________________________________________________________________________________

            |        source (1)         |
int. src    |---------------------------|
            |                           |   + in-point-1           - in-point-1
ex. src     '==========================='
                                            - start-1 + start-2    + start-1 - start-2
ex. sink    .===========================.
            |                           |   - in-point-2           + in-point-2
int. sink   |---------------------------|
            |       operation (2)       |   identity ()            identity ()
int. src    |---------------------------|
            |                           |   + in-point-2           - in-point-2
ex. src     '==========================='
                                            - start-2 + start-3    + start-2 - start-3
ex. sink    .===========================.
            |                           |   - 0                    + 0
int. sink   |---------------------------|
            |     time effect (3)       |   f ()                   f^-1 ()
int. src    |---------------------------|
            |                           |   + 0                    - 0
ex. src     '==========================='
                                            - start-3              + start-3
timeline    +++++++++++++++++++++++++++++
                      timeline

The given function f summarises how a seek will transform as it goes from from the source to the sink of the internal GstElement, and f^-1 summarises how a segment stream time will transform.

In particular, f will be a function

f: [0, MAX] -> [0, G_MAXUINT64 - 1]

where MAX is some guint64. For what follows, we will only fully support time effects whose function f:

  • is monotonically increasing. This would exclude time effects that play some later content, and then jump back to earlier content.
  • is 'continuously reversible'. We define T_d for the time t as the set of all the times t' such that |f (t') - t| <= d. This property requires that, for any t between f's minimum and maximum values, we can choose a small d such that T_d is not empty, is small and has no gaps. The word "small" refers to an unnoticeable difference (the times are in nanoseconds). This means that f can be approximately reversed at all points between its minimum and maximum, which means that f^-1 can act as a close inverse of f. For a monotonically increasing function, this means that f is steadily increasing.

For example, if f simply doubles the time, then for time t = 501, we can choose d=1, and T_d would be {250, 251}.

This would exclude a time effect which has a large jump, because there would be a time t between this jump, whose T_d would be empty for all small d.

This would also exclude a time effect that creates a freeze-frame effect by always seeking to the same spot, because at the time t of this freeze-frame, T_d would be large for all d.

  • obeys f (0) = 0. This would exclude a time effect that introduces an initial shift in the requested source time.
  • has a MAX that is large enough. For example, 24 hours would be fine for a timeline. This would exclude a rate effect with a very large speedup.
  • does not depend on any property outside of the effect element, or on the data it receives. This would exclude a time effect that, say, goes faster if there is more red in the image.

In what follows, a time effect that breaks one of these can still be used, but not all the features will work.

Translations Handled by nleobject Pads

An nleobject source pad will translate outgoing times by applying

time-out = time-in + start - in-point

This will translate from the internal source coordinates to the timeline coordinates if it is the most downstream element. Similarly, an nleobject sink pad will translate incoming times by applying

time-out = time-in - start + in-point

If we have two nle-objects, object-up and object-down, that have their pads linked, then a time time-up from object-up's internal GstElement, would be translated at the link to

time-down
 = time-up + start-up - in-point-up - start-down + in-point-down

So the pads will overall translate from the internal source coordinates of the upstream element to the internal sink coordinates of the downstream element.

Undefined Translations

Note that the coordinate transformation from the timeline time to an upstream time may be undefined depending on the configuration of elements in the timeline. For example, consider the earlier example stack, with the operation starting later than the time effect, such that

d = (start-2 - start-3) > 0

And we choose the time

time = start-2
     = start-3 + d

Then, when time is transformed to the external source coordinates of the operation, we have

operation-source-time = f (time - start-3) - start-2 + start-3
                      = f (d) - d

If the time effect slows down the consumption rate, then f (d) < d, which would make the time undefined in the external source coordinates (we can not have a negative GstClockTime). Basically, the effect is trying to access content that is before the operation.

We can similarly have an effect that tries to access content that is later than the operation, but this wouldn't lead to an underflow of the time. It can however lead to a request for data that is outside the internal content of the operation.

Mismatched Coordinates

The coordinates of an element are only defined relative to the stack that they are in. However, if we have no time effects, these coordinates will line up. Consider the following source and operation configuration.

            |        source (1)         |
            |---------------------------|
            |                           |
            '==========================='

.===========================.
|                           |
|---------------------------|
|       operation (2)       |
|---------------------------|
|                           |
'==========================='

+++++++++++++++++++++++++++++++++++++++++
              timeline

This gives us three stacks

                          |      (1)      |            |    (1)    |
                          |---------------|            |-----------|
                          |               |            |           |
                          '===============.            '==========='
                          0           (s1-s2+d1)   (s1-s2+d1)      d2

 0        (s2-s1)      (s2-s1)           d1
 .===========.            .===============.
 |           |            |               |
 |-----------|            |---------------|
 |    (2)    |            |      (2)      |
 |-----------|            |---------------|
 |           |            |               |
 '==========='            '==============='
 0        (s2-s1)      (s2-s1)           d1

 +++++++++++++            +++++++++++++++++            +++++++++++++
s1          s2           s2            (s1+d1)      (s1+d1)     (s2+d2)

where we have written in times in the external coordinates of the elements, where s1 and d1 are the start and duration of the source, and similarly for s2 and d2 for the operation. We can see that the edge times of all coordinates match up with their neighbours. Therefore, for both elements, there coordinates across each stack can be combined into a single coordinate system.

Consider that instead of the operation we have a time effect, then we would have

                          |      (1)      |            |    (1)    |
                          |---------------|            |-----------|
                          |               |            |           |
                          '===============.            '==========='
                      f(s2-s1)          f(d1)      (s1-s2+d1)      d2
                      -(s2-s1)         -(s2-s1)

f(0)     f(s2-s1)     f(s2-s1)          f(d1)
 .===========.            .===============.
 |           |            |               |
 |-----------|            |---------------|
 |    (2)    |            |      (2)      |
 |-----------|            |---------------|
 |           |            |               |
 '==========='            '==============='
 0        (s2-s1)      (s2-s1)           d1

 +++++++++++++            +++++++++++++++++            +++++++++++++
s1          s2           s2            (s1+d1)      (s1+d1)     (s2+d2)

We can see that the coordinates of the source now start at f(s2-s1) - (s2-s1), rather than 0. We can also see that the external source coordinates of the source jump by (d1 - f(d1)) when the time effect ends. Therefore, most time effects will prevent the coordinates from different stacks from being combined. This can lead to counter-intuitive behaviour.

A further example would be a rate effect with rate=3 that covers two sources that are side by side. The rate effect will not treat this as playing the sources concatenated, at triple speed. Instead, it would play the first source at triple speed, and once it reaches the starting timeline time of the second source, it will start playing the second source instead, but starting from the internal source coordinates

3 * (source-start - rate-start) - (source-start - rate-start) + source-in-point
 = 2 * (source-start - rate-start) + source-in-point

Note that if this was a slowed down rate, this would have been an undefined (negative) time, as we mentioned earlier.

Therefore, in general, time effects should only be placed at a higher priority than elements that share the same start and duration as it. Note that it is fine to place an operation with a higher priority on top of a time effect with a different start or duration because this will not lead to a change in the coordinates.

This is why only a GESSourceClip can have time effects added to it.

There is a general exception to this: if a time effect obeys f (0) = 0, then it will not introduce mismatched coordinates downstream if it has a later start than all the elements it has a higher priority than, and its end timeline time matches all of theirs. Note that this is because the effect would only exist in a single stack, and starts by apply no change to the times it receives.

GESTimelineElement times

The start and duration of an element use the timeline time coordinates. in-point and max-duration use the internal source coordinates. These last two should be 0 and GST_CLOCK_TIME_NONE respectively for time effects.

How to Translate Between Time Coordinates of a Clip

Consider a GESTrackElement element in a GESClip clip in a timeline. It has n active elements with higher priority in the same clip and track, labelled by i=1,...,n, where element 1 has a higher priority than element 2, and so on. Each element has an associated function f_i that translates from its external source coordinates to its external sink coordinates. Note that for elements that apply no time effect, this will be an identity, regardless of their in-point. We can define the function F, such that

F(t) = f_n (f_n-1 ( ... f_1 (t)...))

Note that if each f_i has the desired properties, then so will F, with the exception that the maximum value it can translate may have become too small. For example, if several rate effects accumulate into a very large speedup.

Given such an F, we can translate from the timeline time t to the internal source coordinate time of element using

F (t - start) + in-point

This is what is done in ges_clip_get_internal_time_from_timeline_time.

Note that this works because all the elements in clip share the same start. Note that this would not work if there existed an overlapping higher priority time effect outside of the clip because the highest priority clip element would not be receiving a timeline time at its source pads. This is not a problem if there are non-time effects at higher priority because they will pass through a timeline time unchanged.

If F has the desired properties, it will have a well defined inverse F^-1, based on the inverses of f_i, which we can use to reverse this translation:

F^-1 (t - in-point) + start

This is what is done in ges_clip_get_timeline_time_from_internal_time.

duration-limit

The duration-limit is meant to be the largest value we can set the clip's duration to.

It would be given by the minimum

ges_clip_get_timeline_time_from_internal_time (clip, child, child-max-duration) - start

we calculate amongst all its children that have a max-duration. Note that the implementation of _calculate_duration_limit does not use this method directly, but it should give the same result.

Note that this would fail if max-duration is not reachable through a seek. E.g. if the corresponding function F of the time effects acted like

F (t) = t + max-duration + 1

then F^-1 (t) will be undefined for t=max-duartion because its domain will be [max-duration + 1, inf). Note that this function F does not obey F (0)=0, so is not supported in GES.

Note that duration-limit may not be exactly the largest end time possible. If the corresponding function F is monotonically increasing, then there is no source time below max-duration that could give a larger value, but there may be some times beyond max-time that would correspond to the same source time. However, these extra times will only differ from the max-time by a small amount if F is 'continuously reversible', and so max-time would be close enough. Otherwise, we would not have a simple way to know which is the actual largest duration.

Trimming a clip

Normally, trimming is meant to keep the internal content in the same position relative to the timeline. If we are applying a non-constant rate effect, it may not be possible to keep all the internal content appearing in the timeline at the same time whilst changing the start and duration. However, we can keep the start or end frames/samples in the same timeline position.

Trimming the start of a clip to a later time

When trimming the start edge of a clip from timeline time old-start to new-start, where old-start < new-start <= (old-start + duration), we set the in-point of the clip's children such that the internal content that appeared at new-start before the trim, still appears at new-start afterwards.

This would require

new-in-point = old-in-point + F (new-start - old-start)

because this is the internal source time corresponding to new-start.

Note that, after we have finished trimming, assuming the corresponding F has not changed and F (0) = 0,

ges_clip_get_internal_timeline_from_timeline_time (clip, child, new-start)
 = F (new-start - new-start) + new-in-point
 = new-in-point

So after trimming, new-start will correspond to the same source position as before. Note that this would not work if the time effects changed depending on the data they receive (such as a "go faster if we have more red" time effect) because the corresponding F would have changed after setting the in-point. However, we already stated earlier that these are not supported in GES.

Trimming the start of a clip to an earlier time

When trimming the start edge of a clip from timeline time old-start to new-start, where new-start < old-start, we set the in-point of the clip's children such that the internal content that appeared at old-start before the trim, still appears at old-start afterwards.

new-in-point = old-in-point - F (old-start - new-start)

Note that this will fail if the second argument is too big, which indicates that it would be before there is any internal content.

In terms of the function F earlier, since this is calculated using the new start and old in-point, the source-time would be

Note, after we have finished trimming, assuming the corresponding F has not changed,

ges_clip_get_internal_time_from_timeline_time (clip, child, old-start)
 = F (old-start - new-start) + new-in-point
 = F (old-start - new-start) + old-in-point - F (old-start - new-start)
 = old-in-point

So after trimming, old-start will correspond to the same source time as before.

Note that ges_clip_get_internal_time_from_timeline_time will perform this same calculation if it receives a timeline time before the start of the clip. So timeline-tree is simply able to call ges_clip_get_internal_time_from_timeline_time (clip, child, new_start, error) in both cases.

Trimming the end of a clip

This is as simple as changing the duration of the clip since everything will stay at the same timeline position anyway (assuming F does not change, as required by GES). It just cannot go above the clip's duration-limit.

Splitting a clip

The in-point of the new clip is chosen to match the new out-point of the split clip. This won't work well if different core children of the clip will end up with very different out-points. But if these differences are within half a source frame, GES will not complain. The same can happen when trimming a clip, since all the core children must share the same in-point.

Buffer Timestamps

NOTE: As of 21 May, the recommended changes are not implemented in GES. This delves into why translations will be needed for non-linear time effects.

Currently, the nleobject pads will leave the buffer times unchanged. Which means that an nlesource will send out buffer timestamped using its internal source coordinates.

This is in contrast to a pitch within an nleoperation, which would translate the buffer times from its internal sink coordinates to its internal source coordinates.

Since the internal source coordinates of the nlesource do not match the internal sink coordinates of the nleoperation (they will differ by in-point), this will result in buffer times that are not in any coordinates.

This will make it difficult to use control bindings which are to be given in stream time, which is linked to the buffer timestamps.

We can explore this in more detail. According to the GStreamer design docs, the stream time is used for

  • report the POSITION query in the pipeline
  • the position used in seek events/queries
  • the position used to synchronize controller values

Therefore, in our case, we can say that the stream time at a given position in an nle stack should match the corresponding seek time.

If we have no applied rate, which shouldn't be the case for a normal uses of a timeline, the stream-time is given by

    stream-time = buffer.timestamp - seg.start + seg.time

Thus, the stream time is basically the internal source or sink coordinates. In GESTrackElement control sources are meant to be given in the internal source coordinates.

We will now look at what these time values currently are set to in a stack of an nlesource and an nleoperation that share the same start and duration.

Currently, the nleobject pads will only change the seg.time of the segments it receives by adding or subtracting (start - in-point).

We will assume that the GstElement that the nleoperation wraps is applying its time effect to seg.time, seg.start and buffer.timestamp, which is given by the function g. Note that this is what pitch currently does. Its not clear to me what videorate does to the buffer.timestamp, but it does transform seg.start the same way as seg.time.

The following is a table of what the seg.time, seg.start and buffer.timestamp values are when leaving a pad. The "internal src pad" refers to the source pad of the internal GstElement. s is the start of the objects, and i is the in-point of the nlesource. Following these is what the corresponding stream time would be using these values. The final row is what the corresponding seek position would be coming into the pad, if were seeking to the same media time T.

           nlesrc       nlesrc       nleop        nleop        nleop
           internal     external     external     internal     external
           src pad      src pad      sink pad     src pad      src pad

seg.time   i            s            0            g (0)        g (0) + s

seg.start  i            i            i            g (i)        g (i)

buffer.    T            T            T            g (T)        g (T)
timestamp
------------------------------------------------------------------------
stream     T            T            T            g (T)        g (T)
time                    - i          - i          - g (i)      - g (i)
                        + s                       + g (0)      + g (0)
                                                               + s
------------------------------------------------------------------------
seek       T            T            T            g (T - i)    g (T - i)
time                    - i          - i                       + s
                        + s

We can see that after the nleoperation, the seek time and stream time will generally be out of sync.

Note that if g corresponds to a constant rate effect, then

g (t) = r * t

for some rate r. Then, at the nleoperation external source pad.

stream-time = r * T - r * i + r * 0 + s
            = g (T - i) + s
            = seek-time

so the two will match up under the current behaviour for this special case. However, if the rate varies, this will break down.

If, instead, we translate seg.start and buffer.timestamp in the same way as seg.time on the nleobject pads, by adding or subtracting (start - in-point), then we will always have

seg.start = seg.time

which means that we would also have

seek-time = stream-time = buffer.timestamp

Finally, it would be a good if the convention for a time effect was to use the output stream time in gst_object_sync_values, rather than the input stream time. This would make them compatible with GES's rule that control sources are given in the internal source coordinates. Luckily, it seems that pitch already uses the output stream time. videorate doesn't currently use gst_object_sync_values.