gstreamer/docs/design/draft-subtitle-overlays.txt
2012-10-01 11:58:36 +02:00

546 lines
25 KiB
Text

===============================================================
Subtitle overlays, hardware-accelerated decoding and playbin
===============================================================
Status: EARLY DRAFT / BRAINSTORMING
=== 1. Background ===
Subtitles can be muxed in containers or come from an external source.
Subtitles come in many shapes and colours. Usually they are either
text-based (incl. 'pango markup'), or bitmap-based (e.g. DVD subtitles
and the most common form of DVB subs). Bitmap based subtitles are
usually compressed in some way, like some form of run-length encoding.
Subtitles are currently decoded and rendered in subtitle-format-specific
overlay elements. These elements have two sink pads (one for raw video
and one for the subtitle format in question) and one raw video source pad.
They will take care of synchronising the two input streams, and of
decoding and rendering the subtitles on top of the raw video stream.
Digression: one could theoretically have dedicated decoder/render elements
that output an AYUV or ARGB image, and then let a videomixer element do
the actual overlaying, but this is not very efficient, because it requires
us to allocate and blend whole pictures (1920x1080 AYUV = 8MB,
1280x720 AYUV = 3.6MB, 720x576 AYUV = 1.6MB) even if the overlay region
is only a small rectangle at the bottom. This wastes memory and CPU.
We could do something better by introducing a new format that only
encodes the region(s) of interest, but we don't have such a format yet, and
are not necessarily keen to rewrite this part of the logic in playbin
at this point - and we can't change existing elements' behaviour, so would
need to introduce new elements for this.
Playbin2 supports outputting compressed formats, i.e. it does not
force decoding to a raw format, but is happy to output to a non-raw
format as long as the sink supports that as well.
In case of certain hardware-accelerated decoding APIs, we will make use
of that functionality. However, the decoder will not output a raw video
format then, but some kind of hardware/API-specific format (in the caps)
and the buffers will reference hardware/API-specific objects that
the hardware/API-specific sink will know how to handle.
=== 2. The Problem ===
In the case of such hardware-accelerated decoding, the decoder will not
output raw pixels that can easily be manipulated. Instead, it will
output hardware/API-specific objects that can later be used to render
a frame using the same API.
Even if we could transform such a buffer into raw pixels, we most
likely would want to avoid that, in order to avoid the need to
map the data back into system memory (and then later back to the GPU).
It's much better to upload the much smaller encoded data to the GPU/DSP
and then leave it there until rendered.
Currently playbin only supports subtitles on top of raw decoded video.
It will try to find a suitable overlay element from the plugin registry
based on the input subtitle caps and the rank. (It is assumed that we
will be able to convert any raw video format into any format required
by the overlay using a converter such as videoconvert.)
It will not render subtitles if the video sent to the sink is not
raw YUV or RGB or if conversions have been disabled by setting the
native-video flag on playbin.
Subtitle rendering is considered an important feature. Enabling
hardware-accelerated decoding by default should not lead to a major
feature regression in this area.
This means that we need to support subtitle rendering on top of
non-raw video.
=== 3. Possible Solutions ===
The goal is to keep knowledge of the subtitle format within the
format-specific GStreamer plugins, and knowledge of any specific
video acceleration API to the GStreamer plugins implementing
that API. We do not want to make the pango/dvbsuboverlay/dvdspu/kate
plugins link to libva/libvdpau/etc. and we do not want to make
the vaapi/vdpau plugins link to all of libpango/libkate/libass etc.
Multiple possible solutions come to mind:
(a) backend-specific overlay elements
e.g. vaapitextoverlay, vdpautextoverlay, vaapidvdspu, vdpaudvdspu,
vaapidvbsuboverlay, vdpaudvbsuboverlay, etc.
This assumes the overlay can be done directly on the backend-specific
object passed around.
The main drawback with this solution is that it leads to a lot of
code duplication and may also lead to uncertainty about distributing
certain duplicated pieces of code. The code duplication is pretty
much unavoidable, since making textoverlay, dvbsuboverlay, dvdspu,
kate, assrender, etc. available in form of base classes to derive
from is not really an option. Similarly, one would not really want
the vaapi/vdpau plugin to depend on a bunch of other libraries
such as libpango, libkate, libtiger, libass, etc.
One could add some new kind of overlay plugin feature though in
combination with a generic base class of some sort, but in order
to accommodate all the different cases and formats one would end
up with quite convoluted/tricky API.
(Of course there could also be a GstFancyVideoBuffer that provides
an abstraction for such video accelerated objects and that could
provide an API to add overlays to it in a generic way, but in the
end this is just a less generic variant of (c), and it is not clear
that there are real benefits to a specialised solution vs. a more
generic one).
(b) convert backend-specific object to raw pixels and then overlay
Even where possible technically, this is most likely very
inefficient.
(c) attach the overlay data to the backend-specific video frame buffers
in a generic way and do the actual overlaying/blitting later in
backend-specific code such as the video sink (or an accelerated
encoder/transcoder)
In this case, the actual overlay rendering (i.e. the actual text
rendering or decoding DVD/DVB data into pixels) is done in the
subtitle-format-specific GStreamer plugin. All knowledge about
the subtitle format is contained in the overlay plugin then,
and all knowledge about the video backend in the video backend
specific plugin.
The main question then is how to get the overlay pixels (and
we will only deal with pixels here) from the overlay element
to the video sink.
This could be done in multiple ways: One could send custom
events downstream with the overlay data, or one could attach
the overlay data directly to the video buffers in some way.
Sending inline events has the advantage that is is fairly
transparent to any elements between the overlay element and
the video sink: if an effects plugin creates a new video
buffer for the output, nothing special needs to be done to
maintain the subtitle overlay information, since the overlay
data is not attached to the buffer. However, it slightly
complicates things at the sink, since it would also need to
look for the new event in question instead of just processing
everything in its buffer render function.
If one attaches the overlay data to the buffer directly, any
element between overlay and video sink that creates a new
video buffer would need to be aware of the overlay data
attached to it and copy it over to the newly-created buffer.
One would have to do implement a special kind of new query
(e.g. FEATURE query) that is not passed on automatically by
gst_pad_query_default() in order to make sure that all elements
downstream will handle the attached overlay data. (This is only
a problem if we want to also attach overlay data to raw video
pixel buffers; for new non-raw types we can just make it
mandatory and assume support and be done with it; for existing
non-raw types nothing changes anyway if subtitles don't work)
(we need to maintain backwards compatibility for existing raw
video pipelines like e.g.: ..decoder ! suboverlay ! encoder..)
Even though slightly more work, attaching the overlay information
to buffers seems more intuitive than sending it interleaved as
events. And buffers stored or passed around (e.g. via the
"last-buffer" property in the sink when doing screenshots via
playbin) always contain all the information needed.
(d) create a video/x-raw-*-delta format and use a backend-specific videomixer
This possibility was hinted at already in the digression in
section 1. It would satisfy the goal of keeping subtitle format
knowledge in the subtitle plugins and video backend knowledge
in the video backend plugin. It would also add a concept that
might be generally useful (think ximagesrc capture with xdamage).
However, it would require adding foorender variants of all the
existing overlay elements, and changing playbin to that new
design, which is somewhat intrusive. And given the general
nature of such a new format/API, we would need to take a lot
of care to be able to accommodate all possible use cases when
designing the API, which makes it considerably more ambitious.
Lastly, we would need to write videomixer variants for the
various accelerated video backends as well.
Overall (c) appears to be the most promising solution. It is the least
intrusive and should be fairly straight-forward to implement with
reasonable effort, requiring only small changes to existing elements
and requiring no new elements.
Doing the final overlaying in the sink as opposed to a videomixer
or overlay in the middle of the pipeline has other advantages:
- if video frames need to be dropped, e.g. for QoS reasons,
we could also skip the actual subtitle overlaying and
possibly the decoding/rendering as well, if the
implementation and API allows for that to be delayed.
- the sink often knows the actual size of the window/surface/screen
the output video is rendered to. This *may* make it possible to
render the overlay image in a higher resolution than the input
video, solving a long standing issue with pixelated subtitles on
top of low-resolution videos that are then scaled up in the sink.
This would require for the rendering to be delayed of course instead
of just attaching an AYUV/ARGB/RGBA blog of pixels to the video buffer
in the overlay, but that could all be supported.
- if the video backend / sink has support for high-quality text
rendering (clutter?) we could just pass the text or pango markup
to the sink and let it do the rest (this is unlikely to be
supported in the general case - text and glyph rendering is
hard; also, we don't really want to make up our own text markup
system, and pango markup is probably too limited for complex
karaoke stuff).
=== 4. API needed ===
(a) Representation of subtitle overlays to be rendered
We need to pass the overlay pixels from the overlay element to the
sink somehow. Whatever the exact mechanism, let's assume we pass
a refcounted GstVideoOverlayComposition struct or object.
A composition is made up of one or more overlays/rectangles.
In the simplest case an overlay rectangle is just a blob of
RGBA/ABGR [FIXME?] or AYUV pixels with positioning info and other
metadata, and there is only one rectangle to render.
We're keeping the naming generic ("OverlayFoo" rather than
"SubtitleFoo") here, since this might also be handy for
other use cases such as e.g. logo overlays or so. It is not
designed for full-fledged video stream mixing though.
// Note: don't mind the exact implementation details, they'll be hidden
// FIXME: might be confusing in 0.11 though since GstXOverlay was
// renamed to GstVideoOverlay in 0.11, but not much we can do,
// maybe we can rename GstVideoOverlay to something better
struct GstVideoOverlayComposition
{
guint num_rectangles;
GstVideoOverlayRectangle ** rectangles;
/* lowest rectangle sequence number still used by the upstream
* overlay element. This way a renderer maintaining some kind of
* rectangles <-> surface cache can know when to free cached
* surfaces/rectangles. */
guint min_seq_num_used;
/* sequence number for the composition (same series as rectangles) */
guint seq_num;
}
struct GstVideoOverlayRectangle
{
/* Position on video frame and dimension of output rectangle in
* output frame terms (already adjusted for the PAR of the output
* frame). x/y can be negative (overlay will be clipped then) */
gint x, y;
guint render_width, render_height;
/* Dimensions of overlay pixels */
guint width, height, stride;
/* This is the PAR of the overlay pixels */
guint par_n, par_d;
/* Format of pixels, GST_VIDEO_FORMAT_ARGB on big-endian systems,
* and BGRA on little-endian systems (i.e. pixels are treated as
* 32-bit values and alpha is always in the most-significant byte,
* and blue is in the least-significant byte).
*
* FIXME: does anyone actually use AYUV in practice? (we do
* in our utility function to blend on top of raw video)
* What about AYUV and endianness? Do we always have [A][Y][U][V]
* in memory? */
/* FIXME: maybe use our own enum? */
GstVideoFormat format;
/* Refcounted blob of memory, no caps or timestamps */
GstBuffer *pixels;
// FIXME: how to express source like text or pango markup?
// (just add source type enum + source buffer with data)
//
// FOR 0.10: always send pixel blobs, but attach source data in
// addition (reason: if downstream changes, we can't renegotiate
// that properly, if we just do a query of supported formats from
// the start). Sink will just ignore pixels and use pango markup
// from source data if it supports that.
//
// FOR 0.11: overlay should query formats (pango markup, pixels)
// supported by downstream and then only send that. We can
// renegotiate via the reconfigure event.
//
/* sequence number: useful for backends/renderers/sinks that want
* to maintain a cache of rectangles <-> surfaces. The value of
* the min_seq_num_used in the composition tells the renderer which
* rectangles have expired. */
guint seq_num;
/* FIXME: we also need a (private) way to cache converted/scaled
* pixel blobs */
}
(a1) Overlay consumer API:
How would this work in a video sink that supports scaling of textures:
gst_foo_sink_render () {
/* assume only one for now */
if video_buffer has composition:
composition = video_buffer.get_composition()
for each rectangle in composition:
if rectangle.source_data_type == PANGO_MARKUP
actor = text_from_pango_markup (rectangle.get_source_data())
else
pixels = rectangle.get_pixels_unscaled (FORMAT_RGBA, ...)
actor = texture_from_rgba (pixels, ...)
.. position + scale on top of video surface ...
}
(a2) Overlay producer API:
e.g. logo or subpicture overlay: got pixels, stuff into rectangle:
if (logoverlay->cached_composition == NULL) {
comp = composition_new ();
rect = rectangle_new (format, pixels_buf,
width, height, stride, par_n, par_d,
x, y, render_width, render_height);
/* composition adds its own ref for the rectangle */
composition_add_rectangle (comp, rect);
rectangle_unref (rect);
/* buffer adds its own ref for the composition */
video_buffer_attach_composition (comp);
/* we take ownership of the composition and save it for later */
logoverlay->cached_composition = comp;
} else {
video_buffer_attach_composition (logoverlay->cached_composition);
}
FIXME: also add some API to modify render position/dimensions of
a rectangle (probably requires creation of new rectangle, unless
we handle writability like with other mini objects).
(b) Fallback overlay rendering/blitting on top of raw video
Eventually we want to use this overlay mechanism not only for
hardware-accelerated video, but also for plain old raw video,
either at the sink or in the overlay element directly.
Apart from the advantages listed earlier in section 3, this
allows us to consolidate a lot of overlaying/blitting code that
is currently repeated in every single overlay element in one
location. This makes it considerably easier to support a whole
range of raw video formats out of the box, add SIMD-optimised
rendering using ORC, or handle corner cases correctly.
(Note: side-effect of overlaying raw video at the video sink is
that if e.g. a screnshotter gets the last buffer via the last-buffer
property of basesink, it would get an image without the subtitles
on top. This could probably be fixed by re-implementing the
property in GstVideoSink though. Playbin2 could handle this
internally as well).
void
gst_video_overlay_composition_blend (GstVideoOverlayComposition * comp
GstBuffer * video_buf)
{
guint n;
g_return_if_fail (gst_buffer_is_writable (video_buf));
g_return_if_fail (GST_BUFFER_CAPS (video_buf) != NULL);
... parse video_buffer caps into BlendVideoFormatInfo ...
for each rectangle in the composition: {
if (gst_video_format_is_yuv (video_buf_format)) {
overlay_format = FORMAT_AYUV;
} else if (gst_video_format_is_rgb (video_buf_format)) {
overlay_format = FORMAT_ARGB;
} else {
/* FIXME: grayscale? */
return;
}
/* this will scale and convert AYUV<->ARGB if needed */
pixels = rectangle_get_pixels_scaled (rectangle, overlay_format);
... clip output rectangle ...
__do_blend (video_buf_format, video_buf->data,
overlay_format, pixels->data,
x, y, width, height, stride);
gst_buffer_unref (pixels);
}
}
(c) Flatten all rectangles in a composition
We cannot assume that the video backend API can handle any
number of rectangle overlays, it's possible that it only
supports one single overlay, in which case we need to squash
all rectangles into one.
However, we'll just declare this a corner case for now, and
implement it only if someone actually needs it. It's easy
to add later API-wise. Might be a bit tricky if we have
rectangles with different PARs/formats (e.g. subs and a logo),
though we could probably always just use the code from (b)
with a fully transparent video buffer to create a flattened
overlay buffer.
(d) core API: new FEATURE query
For 0.10 we need to add a FEATURE query, so the overlay element
can query whether the sink downstream and all elements between
the overlay element and the sink support the new overlay API.
Elements in between need to support it because the render
positions and dimensions need to be updated if the video is
cropped or rescaled, for example.
In order to ensure that all elements support the new API,
we need to drop the query in the pad default query handler
(so it only succeeds if all elements handle it explicitly).
Might want two variants of the feature query - one where
all elements in the chain need to support it explicitly
and one where it's enough if some element downstream
supports it.
In 0.11 this could probably be handled via GstMeta and
ALLOCATION queries (and/or we could simply require
elements to be aware of this API from the start).
There appears to be no issue with downstream possibly
not being linked yet at the time when an overlay would
want to do such a query.
Other considerations:
- renderers (overlays or sinks) may be able to handle only ARGB or only AYUV
(for most graphics/hw-API it's likely ARGB of some sort, while our
blending utility functions will likely want the same colour space as
the underlying raw video format, which is usually YUV of some sort).
We need to convert where required, and should cache the conversion.
- renderers may or may not be able to scale the overlay. We need to
do the scaling internally if not (simple case: just horizontal scaling
to adjust for PAR differences; complex case: both horizontal and vertical
scaling, e.g. if subs come from a different source than the video or the
video has been rescaled or cropped between overlay element and sink).
- renderers may be able to generate (possibly scaled) pixels on demand
from the original data (e.g. a string or RLE-encoded data). We will
ignore this for now, since this functionality can still be added later
via API additions. The most interesting case would be to pass a pango
markup string, since e.g. clutter can handle that natively.
- renderers may be able to write data directly on top of the video pixels
(instead of creating an intermediary buffer with the overlay which is
then blended on top of the actual video frame), e.g. dvdspu, dvbsuboverlay
However, in the interest of simplicity, we should probably ignore the
fact that some elements can blend their overlays directly on top of the
video (decoding/uncompressing them on the fly), even more so as it's
not obvious that it's actually faster to decode the same overlay
70-90 times (say) (ie. ca. 3 seconds of video frames) and then blend
it 70-90 times instead of decoding it once into a temporary buffer
and then blending it directly from there, possibly SIMD-accelerated.
Also, this is only relevant if the video is raw video and not some
hardware-acceleration backend object.
And ultimately it is the overlay element that decides whether to do
the overlay right there and then or have the sink do it (if supported).
It could decide to keep doing the overlay itself for raw video and
only use our new API for non-raw video.
- renderers may want to make sure they only upload the overlay pixels once
per rectangle if that rectangle recurs in subsequent frames (as part of
the same composition or a different composition), as is likely. This caching
of e.g. surfaces needs to be done renderer-side and can be accomplished
based on the sequence numbers. The composition contains the lowest
sequence number still in use upstream (an overlay element may want to
cache created compositions+rectangles as well after all to re-use them
for multiple frames), based on that the renderer can expire cached
objects. The caching needs to be done renderer-side because attaching
renderer-specific objects to the rectangles won't work well given the
refcounted nature of rectangles and compositions, making it unpredictable
when a rectangle or composition will be freed or from which thread
context it will be freed. The renderer-specific objects are likely bound
to other types of renderer-specific contexts, and need to be managed
in connection with those.
- composition/rectangles should internally provide a certain degree of
thread-safety. Multiple elements (sinks, overlay element) might access
or use the same objects from multiple threads at the same time, and it
is expected that elements will keep a ref to compositions and rectangles
they push downstream for a while, e.g. until the current subtitle
composition expires.
=== 5. Future considerations ===
- alternatives: there may be multiple versions/variants of the same subtitle
stream. On DVDs, there may be a 4:3 version and a 16:9 version of the same
subtitles. We could attach both variants and let the renderer pick the best
one for the situation (currently we just use the 16:9 version). With totem,
it's ultimately totem that adds the 'black bars' at the top/bottom, so totem
also knows if it's got a 4:3 display and can/wants to fit 4:3 subs (which
may render on top of the bars) or not, for example.
=== 6. Misc. FIXMEs ===
TEST: should these look (roughly) alike (note text distortion) - needs fixing in textoverlay
gst-launch-0.10 \
videotestsrc ! video/x-raw,width=640,height=480,pixel-aspect-ratio=1/1 ! textoverlay text=Hello font-desc=72 ! xvimagesink \
videotestsrc ! video/x-raw,width=320,height=480,pixel-aspect-ratio=2/1 ! textoverlay text=Hello font-desc=72 ! xvimagesink \
videotestsrc ! video/x-raw,width=640,height=240,pixel-aspect-ratio=1/2 ! textoverlay text=Hello font-desc=72 ! xvimagesink
~~~ THE END ~~~