gstreamer/subprojects/gst-docs/markdown/additional/design/subtitle-overlays.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

534 lines
25 KiB
Markdown
Raw Normal View History

# Subtitle Overlays and Hardware-Accelerated Playback
This document describes some of the considerations and requirements that
led to the current `GstVideoOverlayCompositionMeta` API which allows
attaching of subtitle bitmaps or logos to video buffers.
## Background
Subtitles can be muxed in containers or come from an external source.
Subtitles come in many shapes and colours. Usually they are either
text-based (incl. 'pango markup'), or bitmap-based (e.g. DVD subtitles
and the most common form of DVB subs). Bitmap based subtitles are
usually compressed in some way, like some form of run-length encoding.
Subtitles are currently decoded and rendered in subtitle-format-specific
overlay elements. These elements have two sink pads (one for raw video
and one for the subtitle format in question) and one raw video source
pad.
They will take care of synchronising the two input streams, and of
decoding and rendering the subtitles on top of the raw video stream.
Digression: one could theoretically have dedicated decoder/render
elements that output an AYUV or ARGB image, and then let a videomixer
element do the actual overlaying, but this is not very efficient,
because it requires us to allocate and blend whole pictures (1920x1080
AYUV = 8MB, 1280x720 AYUV = 3.6MB, 720x576 AYUV = 1.6MB) even if the
overlay region is only a small rectangle at the bottom. This wastes
memory and CPU. We could do something better by introducing a new format
that only encodes the region(s) of interest, but we don't have such a
format yet, and are not necessarily keen to rewrite this part of the
logic in playbin at this point - and we can't change existing elements'
behaviour, so would need to introduce new elements for this.
Playbin supports outputting compressed formats, i.e. it does not force
decoding to a raw format, but is happy to output to a non-raw format as
long as the sink supports that as well.
In case of certain hardware-accelerated decoding APIs, we will make use
of that functionality. However, the decoder will not output a raw video
format then, but some kind of hardware/API-specific format (in the caps)
and the buffers will reference hardware/API-specific objects that the
hardware/API-specific sink will know how to handle.
## The Problem
In the case of such hardware-accelerated decoding, the decoder will not
output raw pixels that can easily be manipulated. Instead, it will
output hardware/API-specific objects that can later be used to render a
frame using the same API.
Even if we could transform such a buffer into raw pixels, we most likely
would want to avoid that, in order to avoid the need to map the data
back into system memory (and then later back to the GPU). It's much
better to upload the much smaller encoded data to the GPU/DSP and then
leave it there until rendered.
Before `GstVideoOverlayComposition` playbin only supported subtitles on
top of raw decoded video. It would try to find a suitable overlay element
from the plugin registry based on the input subtitle caps and the rank.
(It is assumed that we will be able to convert any raw video format into
any format required by the overlay using a converter such as videoconvert.)
It would not render subtitles if the video sent to the sink is not raw
YUV or RGB or if conversions had been disabled by setting the
native-video flag on playbin.
Subtitle rendering is considered an important feature. Enabling
hardware-accelerated decoding by default should not lead to a major
feature regression in this area.
This means that we need to support subtitle rendering on top of non-raw
video.
## Possible Solutions
The goal is to keep knowledge of the subtitle format within the
format-specific GStreamer plugins, and knowledge of any specific video
acceleration API to the GStreamer plugins implementing that API. We do
not want to make the pango/dvbsuboverlay/dvdspu/kate plugins link to
libva/libvdpau/etc. and we do not want to make the vaapi/vdpau plugins
link to all of libpango/libkate/libass etc.
Multiple possible solutions come to mind:
1) backend-specific overlay elements
e.g. vaapitextoverlay, vdpautextoverlay, vaapidvdspu, vdpaudvdspu,
vaapidvbsuboverlay, vdpaudvbsuboverlay, etc.
This assumes the overlay can be done directly on the
backend-specific object passed around.
The main drawback with this solution is that it leads to a lot of
code duplication and may also lead to uncertainty about distributing
certain duplicated pieces of code. The code duplication is pretty
much unavoidable, since making textoverlay, dvbsuboverlay, dvdspu,
kate, assrender, etc. available in form of base classes to derive
from is not really an option. Similarly, one would not really want
the vaapi/vdpau plugin to depend on a bunch of other libraries such
as libpango, libkate, libtiger, libass, etc.
One could add some new kind of overlay plugin feature though in
combination with a generic base class of some sort, but in order to
accommodate all the different cases and formats one would end up
with quite convoluted/tricky API.
(Of course there could also be a `GstFancyVideoBuffer` that provides
an abstraction for such video accelerated objects and that could
provide an API to add overlays to it in a generic way, but in the
end this is just a less generic variant of (c), and it is not clear
that there are real benefits to a specialised solution vs. a more
generic one).
2) convert backend-specific object to raw pixels and then overlay
Even where possible technically, this is most likely very
inefficient.
3) attach the overlay data to the backend-specific video frame buffers
in a generic way and do the actual overlaying/blitting later in
backend-specific code such as the video sink (or an accelerated
encoder/transcoder)
In this case, the actual overlay rendering (i.e. the actual text
rendering or decoding DVD/DVB data into pixels) is done in the
subtitle-format-specific GStreamer plugin. All knowledge about the
subtitle format is contained in the overlay plugin then, and all
knowledge about the video backend in the video backend specific
plugin.
The main question then is how to get the overlay pixels (and we will
only deal with pixels here) from the overlay element to the video
sink.
This could be done in multiple ways: One could send custom events
downstream with the overlay data, or one could attach the overlay
data directly to the video buffers in some way.
Sending inline events has the advantage that is fairly
transparent to any elements between the overlay element and the
video sink: if an effects plugin creates a new video buffer for the
output, nothing special needs to be done to maintain the subtitle
overlay information, since the overlay data is not attached to the
buffer. However, it slightly complicates things at the sink, since
it would also need to look for the new event in question instead of
just processing everything in its buffer render function.
If one attaches the overlay data to the buffer directly, any element
between overlay and video sink that creates a new video buffer would
need to be aware of the overlay data attached to it and copy it over
to the newly-created buffer.
One would have to do implement a special kind of new query (e.g.
FEATURE query) that is not passed on automatically by
`gst_pad_query_default()` in order to make sure that all elements
downstream will handle the attached overlay data. (This is only a
problem if we want to also attach overlay data to raw video pixel
buffers; for new non-raw types we can just make it mandatory and
assume support and be done with it; for existing non-raw types
nothing changes anyway if subtitles don't work) (we need to maintain
backwards compatibility for existing raw video pipelines like e.g.:
`..decoder ! suboverlay ! encoder..`)
Even though slightly more work, attaching the overlay information to
buffers seems more intuitive than sending it interleaved as events.
And buffers stored or passed around (e.g. via the "last-buffer"
property in the sink when doing screenshots via playbin) always
contain all the information needed.
4) create a video/x-raw-delta format and use a backend-specific
videomixer
This possibility was hinted at already in the digression in section
1. It would satisfy the goal of keeping subtitle format knowledge in
the subtitle plugins and video backend knowledge in the video
backend plugin. It would also add a concept that might be generally
useful (think ximagesrc capture with xdamage). However, it would
require adding foorender variants of all the existing overlay
elements, and changing playbin to that new design, which is somewhat
intrusive. And given the general nature of such a new format/API, we
would need to take a lot of care to be able to accommodate all
possible use cases when designing the API, which makes it
considerably more ambitious. Lastly, we would need to write
videomixer variants for the various accelerated video backends as
well.
Overall (c) appears to be the most promising solution. It is the least
intrusive and should be fairly straight-forward to implement with
reasonable effort, requiring only small changes to existing elements and
requiring no new elements.
Doing the final overlaying in the sink as opposed to a videomixer or
overlay in the middle of the pipeline has other advantages:
- if video frames need to be dropped, e.g. for QoS reasons, we could
also skip the actual subtitle overlaying and possibly the
decoding/rendering as well, if the implementation and API allows for
that to be delayed.
- the sink often knows the actual size of the window/surface/screen
the output video is rendered to. This *may* make it possible to
render the overlay image in a higher resolution than the input
video, solving a long standing issue with pixelated subtitles on top
of low-resolution videos that are then scaled up in the sink. This
would require for the rendering to be delayed of course instead of
just attaching an AYUV/ARGB/RGBA blog of pixels to the video buffer
in the overlay, but that could all be supported.
- if the video backend / sink has support for high-quality text
rendering (clutter?) we could just pass the text or pango markup to
the sink and let it do the rest (this is unlikely to be supported in
the general case - text and glyph rendering is hard; also, we don't
really want to make up our own text markup system, and pango markup
is probably too limited for complex karaoke stuff).
## API needed
1) Representation of subtitle overlays to be rendered
We need to pass the overlay pixels from the overlay element to the
sink somehow. Whatever the exact mechanism, let's assume we pass a
refcounted `GstVideoOverlayComposition` struct or object.
A composition is made up of one or more overlays/rectangles.
In the simplest case an overlay rectangle is just a blob of
RGBA/ABGR \[FIXME?\] or AYUV pixels with positioning info and other
metadata, and there is only one rectangle to render.
We're keeping the naming generic ("OverlayFoo" rather than
"SubtitleFoo") here, since this might also be handy for other use
cases such as e.g. logo overlays or so. It is not designed for
full-fledged video stream mixing
though.
```
// Note: don't mind the exact implementation details, they'll be hidden
// FIXME: might be confusing in 0.11 though since GstXOverlay was
// renamed to GstVideoOverlay in 0.11, but not much we can do,
// maybe we can rename GstVideoOverlay to something better
struct GstVideoOverlayComposition
{
guint num_rectangles;
GstVideoOverlayRectangle ** rectangles;
/* lowest rectangle sequence number still used by the upstream
* overlay element. This way a renderer maintaining some kind of
* rectangles <-> surface cache can know when to free cached
* surfaces/rectangles. */
guint min_seq_num_used;
/* sequence number for the composition (same series as rectangles) */
guint seq_num;
}
struct GstVideoOverlayRectangle
{
/* Position on video frame and dimension of output rectangle in
* output frame terms (already adjusted for the PAR of the output
* frame). x/y can be negative (overlay will be clipped then) */
gint x, y;
guint render_width, render_height;
/* Dimensions of overlay pixels */
guint width, height, stride;
/* This is the PAR of the overlay pixels */
guint par_n, par_d;
/* Format of pixels, GST_VIDEO_FORMAT_ARGB on big-endian systems,
* and BGRA on little-endian systems (i.e. pixels are treated as
* 32-bit values and alpha is always in the most-significant byte,
* and blue is in the least-significant byte).
*
* FIXME: does anyone actually use AYUV in practice? (we do
* in our utility function to blend on top of raw video)
* What about AYUV and endianness? Do we always have [A][Y][U][V]
* in memory? */
/* FIXME: maybe use our own enum? */
GstVideoFormat format;
/* Refcounted blob of memory, no caps or timestamps */
GstBuffer *pixels;
// FIXME: how to express source like text or pango markup?
// (just add source type enum + source buffer with data)
//
// FOR 0.10: always send pixel blobs, but attach source data in
// addition (reason: if downstream changes, we can't renegotiate
// that properly, if we just do a query of supported formats from
// the start). Sink will just ignore pixels and use pango markup
// from source data if it supports that.
//
// FOR 0.11: overlay should query formats (pango markup, pixels)
// supported by downstream and then only send that. We can
// renegotiate via the reconfigure event.
//
/* sequence number: useful for backends/renderers/sinks that want
* to maintain a cache of rectangles <-> surfaces. The value of
* the min_seq_num_used in the composition tells the renderer which
* rectangles have expired. */
guint seq_num;
/* FIXME: we also need a (private) way to cache converted/scaled
* pixel blobs */
}
(a1) Overlay consumer
API:
How would this work in a video sink that supports scaling of textures:
gst_foo_sink_render () {
/* assume only one for now */
if video_buffer has composition:
composition = video_buffer.get_composition()
for each rectangle in composition:
if rectangle.source_data_type == PANGO_MARKUP
actor = text_from_pango_markup (rectangle.get_source_data())
else
pixels = rectangle.get_pixels_unscaled (FORMAT_RGBA, ...)
actor = texture_from_rgba (pixels, ...)
.. position + scale on top of video surface ...
}
(a2) Overlay producer
API:
e.g. logo or subpicture overlay: got pixels, stuff into rectangle:
if (logoverlay->cached_composition == NULL) {
comp = composition_new ();
rect = rectangle_new (format, pixels_buf,
width, height, stride, par_n, par_d,
x, y, render_width, render_height);
/* composition adds its own ref for the rectangle */
composition_add_rectangle (comp, rect);
rectangle_unref (rect);
/* buffer adds its own ref for the composition */
video_buffer_attach_composition (comp);
/* we take ownership of the composition and save it for later */
logoverlay->cached_composition = comp;
} else {
video_buffer_attach_composition (logoverlay->cached_composition);
}
```
FIXME: also add some API to modify render position/dimensions of a
rectangle (probably requires creation of new rectangle, unless we
handle writability like with other mini objects).
2) Fallback overlay rendering/blitting on top of raw video
Eventually we want to use this overlay mechanism not only for
hardware-accelerated video, but also for plain old raw video, either
at the sink or in the overlay element directly.
Apart from the advantages listed earlier in section 3, this allows
us to consolidate a lot of overlaying/blitting code that is
currently repeated in every single overlay element in one location.
This makes it considerably easier to support a whole range of raw
video formats out of the box, add SIMD-optimised rendering using
ORC, or handle corner cases correctly.
(Note: side-effect of overlaying raw video at the video sink is that
if e.g. a screnshotter gets the last buffer via the last-buffer
property of basesink, it would get an image without the subtitles on
top. This could probably be fixed by re-implementing the property in
`GstVideoSink` though. Playbin2 could handle this internally as well).
```
void
gst_video_overlay_composition_blend (GstVideoOverlayComposition * comp
GstBuffer * video_buf)
{
guint n;
g_return_if_fail (gst_buffer_is_writable (video_buf));
g_return_if_fail (GST_BUFFER_CAPS (video_buf) != NULL);
... parse video_buffer caps into BlendVideoFormatInfo ...
for each rectangle in the composition: {
if (gst_video_format_is_yuv (video_buf_format)) {
overlay_format = FORMAT_AYUV;
} else if (gst_video_format_is_rgb (video_buf_format)) {
overlay_format = FORMAT_ARGB;
} else {
/* FIXME: grayscale? */
return;
}
/* this will scale and convert AYUV<->ARGB if needed */
pixels = rectangle_get_pixels_scaled (rectangle, overlay_format);
... clip output rectangle ...
__do_blend (video_buf_format, video_buf->data,
overlay_format, pixels->data,
x, y, width, height, stride);
gst_buffer_unref (pixels);
}
}
```
3) Flatten all rectangles in a composition
We cannot assume that the video backend API can handle any number of
rectangle overlays, it's possible that it only supports one single
overlay, in which case we need to squash all rectangles into one.
However, we'll just declare this a corner case for now, and
implement it only if someone actually needs it. It's easy to add
later API-wise. Might be a bit tricky if we have rectangles with
different PARs/formats (e.g. subs and a logo), though we could
probably always just use the code from (b) with a fully transparent
video buffer to create a flattened overlay buffer.
4) query support for the new video composition mechanism
This is handled via `GstMeta` and an ALLOCATION query - we can simply
query whether downstream supports the `GstVideoOverlayComposition` meta.
There appears to be no issue with downstream possibly not being
linked yet at the time when an overlay would want to do such a
query, but we would just have to default to something and update
ourselves later on a reconfigure event then.
Other considerations:
- renderers (overlays or sinks) may be able to handle only ARGB or
only AYUV (for most graphics/hw-API it's likely ARGB of some sort,
while our blending utility functions will likely want the same
colour space as the underlying raw video format, which is usually
YUV of some sort). We need to convert where required, and should
cache the conversion.
- renderers may or may not be able to scale the overlay. We need to do
the scaling internally if not (simple case: just horizontal scaling
to adjust for PAR differences; complex case: both horizontal and
vertical scaling, e.g. if subs come from a different source than the
video or the video has been rescaled or cropped between overlay
element and sink).
- renderers may be able to generate (possibly scaled) pixels on demand
from the original data (e.g. a string or RLE-encoded data). We will
ignore this for now, since this functionality can still be added
later via API additions. The most interesting case would be to pass
a pango markup string, since e.g. clutter can handle that natively.
- renderers may be able to write data directly on top of the video
pixels (instead of creating an intermediary buffer with the overlay
which is then blended on top of the actual video frame), e.g.
dvdspu, dvbsuboverlay
However, in the interest of simplicity, we should probably ignore the
fact that some elements can blend their overlays directly on top of the
video (decoding/uncompressing them on the fly), even more so as it's not
obvious that it's actually faster to decode the same overlay 70-90 times
(say) (ie. ca. 3 seconds of video frames) and then blend it 70-90 times
instead of decoding it once into a temporary buffer and then blending it
directly from there, possibly SIMD-accelerated. Also, this is only
relevant if the video is raw video and not some hardware-acceleration
backend object.
And ultimately it is the overlay element that decides whether to do the
overlay right there and then or have the sink do it (if supported). It
could decide to keep doing the overlay itself for raw video and only use
our new API for non-raw video.
- renderers may want to make sure they only upload the overlay pixels
once per rectangle if that rectangle recurs in subsequent frames (as
part of the same composition or a different composition), as is
likely. This caching of e.g. surfaces needs to be done renderer-side
and can be accomplished based on the sequence numbers. The
composition contains the lowest sequence number still in use
upstream (an overlay element may want to cache created
compositions+rectangles as well after all to re-use them for
multiple frames), based on that the renderer can expire cached
objects. The caching needs to be done renderer-side because
attaching renderer-specific objects to the rectangles won't work
well given the refcounted nature of rectangles and compositions,
making it unpredictable when a rectangle or composition will be
freed or from which thread context it will be freed. The
renderer-specific objects are likely bound to other types of
renderer-specific contexts, and need to be managed in connection
with those.
- composition/rectangles should internally provide a certain degree of
thread-safety. Multiple elements (sinks, overlay element) might
access or use the same objects from multiple threads at the same
time, and it is expected that elements will keep a ref to
compositions and rectangles they push downstream for a while, e.g.
until the current subtitle composition expires.
## Future considerations
- alternatives: there may be multiple versions/variants of the same
subtitle stream. On DVDs, there may be a 4:3 version and a 16:9
version of the same subtitles. We could attach both variants and let
the renderer pick the best one for the situation (currently we just
use the 16:9 version). With totem, it's ultimately totem that adds
the 'black bars' at the top/bottom, so totem also knows if it's got
a 4:3 display and can/wants to fit 4:3 subs (which may render on top
of the bars) or not, for example.
## Misc. FIXMEs
TEST: should these look (roughly) alike (note text distortion) - needs
fixing in textoverlay
```
gst-launch-1.0 \
videotestsrc ! video/x-raw,width=640,height=480,pixel-aspect-ratio=1/1 \
! textoverlay text=Hello font-desc=72 ! xvimagesink \
videotestsrc ! video/x-raw,width=320,height=480,pixel-aspect-ratio=2/1 \
! textoverlay text=Hello font-desc=72 ! xvimagesink \
videotestsrc ! video/x-raw,width=640,height=240,pixel-aspect-ratio=1/2 \
! textoverlay text=Hello font-desc=72 ! xvimagesink
```