mirror of
https://gitlab.freedesktop.org/gstreamer/gstreamer.git
synced 2025-01-19 05:45:58 +00:00
533 lines
25 KiB
Markdown
533 lines
25 KiB
Markdown
# Subtitle Overlays and Hardware-Accelerated Playback
|
|
|
|
This document describes some of the considerations and requirements that
|
|
led to the current `GstVideoOverlayCompositionMeta` API which allows
|
|
attaching of subtitle bitmaps or logos to video buffers.
|
|
|
|
## Background
|
|
|
|
Subtitles can be muxed in containers or come from an external source.
|
|
|
|
Subtitles come in many shapes and colours. Usually they are either
|
|
text-based (incl. 'pango markup'), or bitmap-based (e.g. DVD subtitles
|
|
and the most common form of DVB subs). Bitmap based subtitles are
|
|
usually compressed in some way, like some form of run-length encoding.
|
|
|
|
Subtitles are currently decoded and rendered in subtitle-format-specific
|
|
overlay elements. These elements have two sink pads (one for raw video
|
|
and one for the subtitle format in question) and one raw video source
|
|
pad.
|
|
|
|
They will take care of synchronising the two input streams, and of
|
|
decoding and rendering the subtitles on top of the raw video stream.
|
|
|
|
Digression: one could theoretically have dedicated decoder/render
|
|
elements that output an AYUV or ARGB image, and then let a videomixer
|
|
element do the actual overlaying, but this is not very efficient,
|
|
because it requires us to allocate and blend whole pictures (1920x1080
|
|
AYUV = 8MB, 1280x720 AYUV = 3.6MB, 720x576 AYUV = 1.6MB) even if the
|
|
overlay region is only a small rectangle at the bottom. This wastes
|
|
memory and CPU. We could do something better by introducing a new format
|
|
that only encodes the region(s) of interest, but we don't have such a
|
|
format yet, and are not necessarily keen to rewrite this part of the
|
|
logic in playbin at this point - and we can't change existing elements'
|
|
behaviour, so would need to introduce new elements for this.
|
|
|
|
Playbin supports outputting compressed formats, i.e. it does not force
|
|
decoding to a raw format, but is happy to output to a non-raw format as
|
|
long as the sink supports that as well.
|
|
|
|
In case of certain hardware-accelerated decoding APIs, we will make use
|
|
of that functionality. However, the decoder will not output a raw video
|
|
format then, but some kind of hardware/API-specific format (in the caps)
|
|
and the buffers will reference hardware/API-specific objects that the
|
|
hardware/API-specific sink will know how to handle.
|
|
|
|
## The Problem
|
|
|
|
In the case of such hardware-accelerated decoding, the decoder will not
|
|
output raw pixels that can easily be manipulated. Instead, it will
|
|
output hardware/API-specific objects that can later be used to render a
|
|
frame using the same API.
|
|
|
|
Even if we could transform such a buffer into raw pixels, we most likely
|
|
would want to avoid that, in order to avoid the need to map the data
|
|
back into system memory (and then later back to the GPU). It's much
|
|
better to upload the much smaller encoded data to the GPU/DSP and then
|
|
leave it there until rendered.
|
|
|
|
Before `GstVideoOverlayComposition` playbin only supported subtitles on
|
|
top of raw decoded video. It would try to find a suitable overlay element
|
|
from the plugin registry based on the input subtitle caps and the rank.
|
|
(It is assumed that we will be able to convert any raw video format into
|
|
any format required by the overlay using a converter such as videoconvert.)
|
|
|
|
It would not render subtitles if the video sent to the sink is not raw
|
|
YUV or RGB or if conversions had been disabled by setting the
|
|
native-video flag on playbin.
|
|
|
|
Subtitle rendering is considered an important feature. Enabling
|
|
hardware-accelerated decoding by default should not lead to a major
|
|
feature regression in this area.
|
|
|
|
This means that we need to support subtitle rendering on top of non-raw
|
|
video.
|
|
|
|
## Possible Solutions
|
|
|
|
The goal is to keep knowledge of the subtitle format within the
|
|
format-specific GStreamer plugins, and knowledge of any specific video
|
|
acceleration API to the GStreamer plugins implementing that API. We do
|
|
not want to make the pango/dvbsuboverlay/dvdspu/kate plugins link to
|
|
libva/libvdpau/etc. and we do not want to make the vaapi/vdpau plugins
|
|
link to all of libpango/libkate/libass etc.
|
|
|
|
Multiple possible solutions come to mind:
|
|
|
|
1) backend-specific overlay elements
|
|
|
|
e.g. vaapitextoverlay, vdpautextoverlay, vaapidvdspu, vdpaudvdspu,
|
|
vaapidvbsuboverlay, vdpaudvbsuboverlay, etc.
|
|
|
|
This assumes the overlay can be done directly on the
|
|
backend-specific object passed around.
|
|
|
|
The main drawback with this solution is that it leads to a lot of
|
|
code duplication and may also lead to uncertainty about distributing
|
|
certain duplicated pieces of code. The code duplication is pretty
|
|
much unavoidable, since making textoverlay, dvbsuboverlay, dvdspu,
|
|
kate, assrender, etc. available in form of base classes to derive
|
|
from is not really an option. Similarly, one would not really want
|
|
the vaapi/vdpau plugin to depend on a bunch of other libraries such
|
|
as libpango, libkate, libtiger, libass, etc.
|
|
|
|
One could add some new kind of overlay plugin feature though in
|
|
combination with a generic base class of some sort, but in order to
|
|
accommodate all the different cases and formats one would end up
|
|
with quite convoluted/tricky API.
|
|
|
|
(Of course there could also be a `GstFancyVideoBuffer` that provides
|
|
an abstraction for such video accelerated objects and that could
|
|
provide an API to add overlays to it in a generic way, but in the
|
|
end this is just a less generic variant of (c), and it is not clear
|
|
that there are real benefits to a specialised solution vs. a more
|
|
generic one).
|
|
|
|
2) convert backend-specific object to raw pixels and then overlay
|
|
|
|
Even where possible technically, this is most likely very
|
|
inefficient.
|
|
|
|
3) attach the overlay data to the backend-specific video frame buffers
|
|
in a generic way and do the actual overlaying/blitting later in
|
|
backend-specific code such as the video sink (or an accelerated
|
|
encoder/transcoder)
|
|
|
|
In this case, the actual overlay rendering (i.e. the actual text
|
|
rendering or decoding DVD/DVB data into pixels) is done in the
|
|
subtitle-format-specific GStreamer plugin. All knowledge about the
|
|
subtitle format is contained in the overlay plugin then, and all
|
|
knowledge about the video backend in the video backend specific
|
|
plugin.
|
|
|
|
The main question then is how to get the overlay pixels (and we will
|
|
only deal with pixels here) from the overlay element to the video
|
|
sink.
|
|
|
|
This could be done in multiple ways: One could send custom events
|
|
downstream with the overlay data, or one could attach the overlay
|
|
data directly to the video buffers in some way.
|
|
|
|
Sending inline events has the advantage that is fairly
|
|
transparent to any elements between the overlay element and the
|
|
video sink: if an effects plugin creates a new video buffer for the
|
|
output, nothing special needs to be done to maintain the subtitle
|
|
overlay information, since the overlay data is not attached to the
|
|
buffer. However, it slightly complicates things at the sink, since
|
|
it would also need to look for the new event in question instead of
|
|
just processing everything in its buffer render function.
|
|
|
|
If one attaches the overlay data to the buffer directly, any element
|
|
between overlay and video sink that creates a new video buffer would
|
|
need to be aware of the overlay data attached to it and copy it over
|
|
to the newly-created buffer.
|
|
|
|
One would have to do implement a special kind of new query (e.g.
|
|
FEATURE query) that is not passed on automatically by
|
|
`gst_pad_query_default()` in order to make sure that all elements
|
|
downstream will handle the attached overlay data. (This is only a
|
|
problem if we want to also attach overlay data to raw video pixel
|
|
buffers; for new non-raw types we can just make it mandatory and
|
|
assume support and be done with it; for existing non-raw types
|
|
nothing changes anyway if subtitles don't work) (we need to maintain
|
|
backwards compatibility for existing raw video pipelines like e.g.:
|
|
`..decoder ! suboverlay ! encoder..`)
|
|
|
|
Even though slightly more work, attaching the overlay information to
|
|
buffers seems more intuitive than sending it interleaved as events.
|
|
And buffers stored or passed around (e.g. via the "last-buffer"
|
|
property in the sink when doing screenshots via playbin) always
|
|
contain all the information needed.
|
|
|
|
4) create a video/x-raw-\*-delta format and use a backend-specific
|
|
videomixer
|
|
|
|
This possibility was hinted at already in the digression in section
|
|
1. It would satisfy the goal of keeping subtitle format knowledge in
|
|
the subtitle plugins and video backend knowledge in the video
|
|
backend plugin. It would also add a concept that might be generally
|
|
useful (think ximagesrc capture with xdamage). However, it would
|
|
require adding foorender variants of all the existing overlay
|
|
elements, and changing playbin to that new design, which is somewhat
|
|
intrusive. And given the general nature of such a new format/API, we
|
|
would need to take a lot of care to be able to accommodate all
|
|
possible use cases when designing the API, which makes it
|
|
considerably more ambitious. Lastly, we would need to write
|
|
videomixer variants for the various accelerated video backends as
|
|
well.
|
|
|
|
Overall (c) appears to be the most promising solution. It is the least
|
|
intrusive and should be fairly straight-forward to implement with
|
|
reasonable effort, requiring only small changes to existing elements and
|
|
requiring no new elements.
|
|
|
|
Doing the final overlaying in the sink as opposed to a videomixer or
|
|
overlay in the middle of the pipeline has other advantages:
|
|
|
|
- if video frames need to be dropped, e.g. for QoS reasons, we could
|
|
also skip the actual subtitle overlaying and possibly the
|
|
decoding/rendering as well, if the implementation and API allows for
|
|
that to be delayed.
|
|
|
|
- the sink often knows the actual size of the window/surface/screen
|
|
the output video is rendered to. This *may* make it possible to
|
|
render the overlay image in a higher resolution than the input
|
|
video, solving a long standing issue with pixelated subtitles on top
|
|
of low-resolution videos that are then scaled up in the sink. This
|
|
would require for the rendering to be delayed of course instead of
|
|
just attaching an AYUV/ARGB/RGBA blog of pixels to the video buffer
|
|
in the overlay, but that could all be supported.
|
|
|
|
- if the video backend / sink has support for high-quality text
|
|
rendering (clutter?) we could just pass the text or pango markup to
|
|
the sink and let it do the rest (this is unlikely to be supported in
|
|
the general case - text and glyph rendering is hard; also, we don't
|
|
really want to make up our own text markup system, and pango markup
|
|
is probably too limited for complex karaoke stuff).
|
|
|
|
## API needed
|
|
|
|
1) Representation of subtitle overlays to be rendered
|
|
|
|
We need to pass the overlay pixels from the overlay element to the
|
|
sink somehow. Whatever the exact mechanism, let's assume we pass a
|
|
refcounted `GstVideoOverlayComposition` struct or object.
|
|
|
|
A composition is made up of one or more overlays/rectangles.
|
|
|
|
In the simplest case an overlay rectangle is just a blob of
|
|
RGBA/ABGR \[FIXME?\] or AYUV pixels with positioning info and other
|
|
metadata, and there is only one rectangle to render.
|
|
|
|
We're keeping the naming generic ("OverlayFoo" rather than
|
|
"SubtitleFoo") here, since this might also be handy for other use
|
|
cases such as e.g. logo overlays or so. It is not designed for
|
|
full-fledged video stream mixing
|
|
though.
|
|
|
|
```
|
|
// Note: don't mind the exact implementation details, they'll be hidden
|
|
|
|
// FIXME: might be confusing in 0.11 though since GstXOverlay was
|
|
// renamed to GstVideoOverlay in 0.11, but not much we can do,
|
|
// maybe we can rename GstVideoOverlay to something better
|
|
|
|
struct GstVideoOverlayComposition
|
|
{
|
|
guint num_rectangles;
|
|
GstVideoOverlayRectangle ** rectangles;
|
|
|
|
/* lowest rectangle sequence number still used by the upstream
|
|
* overlay element. This way a renderer maintaining some kind of
|
|
* rectangles <-> surface cache can know when to free cached
|
|
* surfaces/rectangles. */
|
|
guint min_seq_num_used;
|
|
|
|
/* sequence number for the composition (same series as rectangles) */
|
|
guint seq_num;
|
|
}
|
|
|
|
struct GstVideoOverlayRectangle
|
|
{
|
|
/* Position on video frame and dimension of output rectangle in
|
|
* output frame terms (already adjusted for the PAR of the output
|
|
* frame). x/y can be negative (overlay will be clipped then) */
|
|
gint x, y;
|
|
guint render_width, render_height;
|
|
|
|
/* Dimensions of overlay pixels */
|
|
guint width, height, stride;
|
|
|
|
/* This is the PAR of the overlay pixels */
|
|
guint par_n, par_d;
|
|
|
|
/* Format of pixels, GST_VIDEO_FORMAT_ARGB on big-endian systems,
|
|
* and BGRA on little-endian systems (i.e. pixels are treated as
|
|
* 32-bit values and alpha is always in the most-significant byte,
|
|
* and blue is in the least-significant byte).
|
|
*
|
|
* FIXME: does anyone actually use AYUV in practice? (we do
|
|
* in our utility function to blend on top of raw video)
|
|
* What about AYUV and endianness? Do we always have [A][Y][U][V]
|
|
* in memory? */
|
|
/* FIXME: maybe use our own enum? */
|
|
GstVideoFormat format;
|
|
|
|
/* Refcounted blob of memory, no caps or timestamps */
|
|
GstBuffer *pixels;
|
|
|
|
// FIXME: how to express source like text or pango markup?
|
|
// (just add source type enum + source buffer with data)
|
|
//
|
|
// FOR 0.10: always send pixel blobs, but attach source data in
|
|
// addition (reason: if downstream changes, we can't renegotiate
|
|
// that properly, if we just do a query of supported formats from
|
|
// the start). Sink will just ignore pixels and use pango markup
|
|
// from source data if it supports that.
|
|
//
|
|
// FOR 0.11: overlay should query formats (pango markup, pixels)
|
|
// supported by downstream and then only send that. We can
|
|
// renegotiate via the reconfigure event.
|
|
//
|
|
|
|
/* sequence number: useful for backends/renderers/sinks that want
|
|
* to maintain a cache of rectangles <-> surfaces. The value of
|
|
* the min_seq_num_used in the composition tells the renderer which
|
|
* rectangles have expired. */
|
|
guint seq_num;
|
|
|
|
/* FIXME: we also need a (private) way to cache converted/scaled
|
|
* pixel blobs */
|
|
}
|
|
|
|
(a1) Overlay consumer
|
|
API:
|
|
|
|
How would this work in a video sink that supports scaling of textures:
|
|
|
|
gst_foo_sink_render () {
|
|
/* assume only one for now */
|
|
if video_buffer has composition:
|
|
composition = video_buffer.get_composition()
|
|
|
|
for each rectangle in composition:
|
|
if rectangle.source_data_type == PANGO_MARKUP
|
|
actor = text_from_pango_markup (rectangle.get_source_data())
|
|
else
|
|
pixels = rectangle.get_pixels_unscaled (FORMAT_RGBA, ...)
|
|
actor = texture_from_rgba (pixels, ...)
|
|
|
|
.. position + scale on top of video surface ...
|
|
}
|
|
|
|
(a2) Overlay producer
|
|
API:
|
|
|
|
e.g. logo or subpicture overlay: got pixels, stuff into rectangle:
|
|
|
|
if (logoverlay->cached_composition == NULL) {
|
|
comp = composition_new ();
|
|
|
|
rect = rectangle_new (format, pixels_buf,
|
|
width, height, stride, par_n, par_d,
|
|
x, y, render_width, render_height);
|
|
|
|
/* composition adds its own ref for the rectangle */
|
|
composition_add_rectangle (comp, rect);
|
|
rectangle_unref (rect);
|
|
|
|
/* buffer adds its own ref for the composition */
|
|
video_buffer_attach_composition (comp);
|
|
|
|
/* we take ownership of the composition and save it for later */
|
|
logoverlay->cached_composition = comp;
|
|
} else {
|
|
video_buffer_attach_composition (logoverlay->cached_composition);
|
|
}
|
|
```
|
|
|
|
FIXME: also add some API to modify render position/dimensions of a
|
|
rectangle (probably requires creation of new rectangle, unless we
|
|
handle writability like with other mini objects).
|
|
|
|
2) Fallback overlay rendering/blitting on top of raw video
|
|
|
|
Eventually we want to use this overlay mechanism not only for
|
|
hardware-accelerated video, but also for plain old raw video, either
|
|
at the sink or in the overlay element directly.
|
|
|
|
Apart from the advantages listed earlier in section 3, this allows
|
|
us to consolidate a lot of overlaying/blitting code that is
|
|
currently repeated in every single overlay element in one location.
|
|
This makes it considerably easier to support a whole range of raw
|
|
video formats out of the box, add SIMD-optimised rendering using
|
|
ORC, or handle corner cases correctly.
|
|
|
|
(Note: side-effect of overlaying raw video at the video sink is that
|
|
if e.g. a screnshotter gets the last buffer via the last-buffer
|
|
property of basesink, it would get an image without the subtitles on
|
|
top. This could probably be fixed by re-implementing the property in
|
|
`GstVideoSink` though. Playbin2 could handle this internally as well).
|
|
|
|
```
|
|
void
|
|
gst_video_overlay_composition_blend (GstVideoOverlayComposition * comp
|
|
GstBuffer * video_buf)
|
|
{
|
|
guint n;
|
|
|
|
g_return_if_fail (gst_buffer_is_writable (video_buf));
|
|
g_return_if_fail (GST_BUFFER_CAPS (video_buf) != NULL);
|
|
|
|
... parse video_buffer caps into BlendVideoFormatInfo ...
|
|
|
|
for each rectangle in the composition: {
|
|
|
|
if (gst_video_format_is_yuv (video_buf_format)) {
|
|
overlay_format = FORMAT_AYUV;
|
|
} else if (gst_video_format_is_rgb (video_buf_format)) {
|
|
overlay_format = FORMAT_ARGB;
|
|
} else {
|
|
/* FIXME: grayscale? */
|
|
return;
|
|
}
|
|
|
|
/* this will scale and convert AYUV<->ARGB if needed */
|
|
pixels = rectangle_get_pixels_scaled (rectangle, overlay_format);
|
|
|
|
... clip output rectangle ...
|
|
|
|
__do_blend (video_buf_format, video_buf->data,
|
|
overlay_format, pixels->data,
|
|
x, y, width, height, stride);
|
|
|
|
gst_buffer_unref (pixels);
|
|
}
|
|
}
|
|
```
|
|
|
|
3) Flatten all rectangles in a composition
|
|
|
|
We cannot assume that the video backend API can handle any number of
|
|
rectangle overlays, it's possible that it only supports one single
|
|
overlay, in which case we need to squash all rectangles into one.
|
|
|
|
However, we'll just declare this a corner case for now, and
|
|
implement it only if someone actually needs it. It's easy to add
|
|
later API-wise. Might be a bit tricky if we have rectangles with
|
|
different PARs/formats (e.g. subs and a logo), though we could
|
|
probably always just use the code from (b) with a fully transparent
|
|
video buffer to create a flattened overlay buffer.
|
|
|
|
4) query support for the new video composition mechanism
|
|
|
|
This is handled via `GstMeta` and an ALLOCATION query - we can simply
|
|
query whether downstream supports the `GstVideoOverlayComposition` meta.
|
|
|
|
There appears to be no issue with downstream possibly not being
|
|
linked yet at the time when an overlay would want to do such a
|
|
query, but we would just have to default to something and update
|
|
ourselves later on a reconfigure event then.
|
|
|
|
Other considerations:
|
|
|
|
- renderers (overlays or sinks) may be able to handle only ARGB or
|
|
only AYUV (for most graphics/hw-API it's likely ARGB of some sort,
|
|
while our blending utility functions will likely want the same
|
|
colour space as the underlying raw video format, which is usually
|
|
YUV of some sort). We need to convert where required, and should
|
|
cache the conversion.
|
|
|
|
- renderers may or may not be able to scale the overlay. We need to do
|
|
the scaling internally if not (simple case: just horizontal scaling
|
|
to adjust for PAR differences; complex case: both horizontal and
|
|
vertical scaling, e.g. if subs come from a different source than the
|
|
video or the video has been rescaled or cropped between overlay
|
|
element and sink).
|
|
|
|
- renderers may be able to generate (possibly scaled) pixels on demand
|
|
from the original data (e.g. a string or RLE-encoded data). We will
|
|
ignore this for now, since this functionality can still be added
|
|
later via API additions. The most interesting case would be to pass
|
|
a pango markup string, since e.g. clutter can handle that natively.
|
|
|
|
- renderers may be able to write data directly on top of the video
|
|
pixels (instead of creating an intermediary buffer with the overlay
|
|
which is then blended on top of the actual video frame), e.g.
|
|
dvdspu, dvbsuboverlay
|
|
|
|
However, in the interest of simplicity, we should probably ignore the
|
|
fact that some elements can blend their overlays directly on top of the
|
|
video (decoding/uncompressing them on the fly), even more so as it's not
|
|
obvious that it's actually faster to decode the same overlay 70-90 times
|
|
(say) (ie. ca. 3 seconds of video frames) and then blend it 70-90 times
|
|
instead of decoding it once into a temporary buffer and then blending it
|
|
directly from there, possibly SIMD-accelerated. Also, this is only
|
|
relevant if the video is raw video and not some hardware-acceleration
|
|
backend object.
|
|
|
|
And ultimately it is the overlay element that decides whether to do the
|
|
overlay right there and then or have the sink do it (if supported). It
|
|
could decide to keep doing the overlay itself for raw video and only use
|
|
our new API for non-raw video.
|
|
|
|
- renderers may want to make sure they only upload the overlay pixels
|
|
once per rectangle if that rectangle recurs in subsequent frames (as
|
|
part of the same composition or a different composition), as is
|
|
likely. This caching of e.g. surfaces needs to be done renderer-side
|
|
and can be accomplished based on the sequence numbers. The
|
|
composition contains the lowest sequence number still in use
|
|
upstream (an overlay element may want to cache created
|
|
compositions+rectangles as well after all to re-use them for
|
|
multiple frames), based on that the renderer can expire cached
|
|
objects. The caching needs to be done renderer-side because
|
|
attaching renderer-specific objects to the rectangles won't work
|
|
well given the refcounted nature of rectangles and compositions,
|
|
making it unpredictable when a rectangle or composition will be
|
|
freed or from which thread context it will be freed. The
|
|
renderer-specific objects are likely bound to other types of
|
|
renderer-specific contexts, and need to be managed in connection
|
|
with those.
|
|
|
|
- composition/rectangles should internally provide a certain degree of
|
|
thread-safety. Multiple elements (sinks, overlay element) might
|
|
access or use the same objects from multiple threads at the same
|
|
time, and it is expected that elements will keep a ref to
|
|
compositions and rectangles they push downstream for a while, e.g.
|
|
until the current subtitle composition expires.
|
|
|
|
## Future considerations
|
|
|
|
- alternatives: there may be multiple versions/variants of the same
|
|
subtitle stream. On DVDs, there may be a 4:3 version and a 16:9
|
|
version of the same subtitles. We could attach both variants and let
|
|
the renderer pick the best one for the situation (currently we just
|
|
use the 16:9 version). With totem, it's ultimately totem that adds
|
|
the 'black bars' at the top/bottom, so totem also knows if it's got
|
|
a 4:3 display and can/wants to fit 4:3 subs (which may render on top
|
|
of the bars) or not, for example.
|
|
|
|
## Misc. FIXMEs
|
|
|
|
TEST: should these look (roughly) alike (note text distortion) - needs
|
|
fixing in textoverlay
|
|
|
|
```
|
|
gst-launch-1.0 \
|
|
videotestsrc ! video/x-raw,width=640,height=480,pixel-aspect-ratio=1/1 \
|
|
! textoverlay text=Hello font-desc=72 ! xvimagesink \
|
|
videotestsrc ! video/x-raw,width=320,height=480,pixel-aspect-ratio=2/1 \
|
|
! textoverlay text=Hello font-desc=72 ! xvimagesink \
|
|
videotestsrc ! video/x-raw,width=640,height=240,pixel-aspect-ratio=1/2 \
|
|
! textoverlay text=Hello font-desc=72 ! xvimagesink
|
|
```
|