gstreamer/docs/design/draft-subtitle-overlays.txt

===============================================================
 Subtitle overlays, hardware-accelerated decoding and playbin2
===============================================================

Status: EARLY DRAFT / BRAINSTORMING

The following text will use "playbin" synonymous with "playbin2".

 === 1. Background ===

Subtitles can be muxed in containers or come from an external source.

Subtitles come in many shapes and colours. Usually they are either
text-based (incl. 'pango markup'), or bitmap-based (e.g. DVD subtitles
and the most common form of DVB subs). Bitmap based subtitles are
usually compressed in some way, like some form of run-length encoding.

Subtitles are currently decoded and rendered in subtitle-format-specific
overlay elements. These elements have two sink pads (one for raw video
and one for the subtitle format in question) and one raw video source pad.

They will take care of synchronising the two input streams, and of
decoding and rendering the subtitles on top of the raw video stream.

Digression: one could theoretically have dedicated decoder/render elements
that output an AYUV or ARGB image, and then let a videomixer element do
the actual overlaying, but this is not very efficient, because it requires
us to allocate and blend whole pictures (1920x1080 AYUV = 8MB,
1280x720 AYUV = 3.6MB, 720x576 AYUV = 1.6MB) even if the overlay region
is only a small rectangle at the bottom. This wastes memory and CPU.
We could do something better by introducing a new format that only
encodes the region(s) of interest, but we don't have such a format yet, and
are not necessarily keen to rewrite this part of the logic in playbin2
at this point - and we can't change existing elements' behaviour, so would
need to introduce new elements for this.

Playbin2 supports outputting compressed formats, i.e. it does not
force decoding to a raw format, but is happy to output to a non-raw
format as long as the sink supports that as well.

In case of certain hardware-accelerated decoding APIs, we will make use
of that functionality. However, the decoder will not output a raw video
format then, but some kind of hardware/API-specific format (in the caps)
and the buffers will reference hardware/API-specific objects that
the hardware/API-specific sink will know how to handle.


 === 2. The Problem ===

In the case of such hardware-accelerated decoding, the decoder will not
output raw pixels that can easily be manipulated. Instead, it will
output hardware/API-specific objects that can later be used to render
a frame using the same API.

Even if we could transform such a buffer into raw pixels, we most
likely would want to avoid that, in order to avoid the need to
map the data back into system memory (and then later back to the GPU).
It's much better to upload the much smaller encoded data to the GPU/DSP
and then leave it there until rendered.

Currently playbin2 only supports subtitles on top of raw decoded video.
It will try to find a suitable overlay element from the plugin registry
based on the input subtitle caps and the rank. (It is assumed that we
will be able to convert any raw video format into any format required
by the overlay using a converter such as videoconvert.)

It will not render subtitles if the video sent to the sink is not
raw YUV or RGB or if conversions have been disabled by setting the
native-video flag on playbin2.

Subtitle rendering is considered an important feature. Enabling
hardware-accelerated decoding by default should not lead to a major
feature regression in this area.

This means that we need to support subtitle rendering on top of
non-raw video.


 === 3. Possible Solutions ===

The goal is to keep knowledge of the subtitle format within the
format-specific GStreamer plugins, and knowledge of any specific
video acceleration API to the GStreamer plugins implementing
that API. We do not want to make the pango/dvbsuboverlay/dvdspu/kate
plugins link to libva/libvdpau/etc. and we do not want to make
the vaapi/vdpau plugins link to all of libpango/libkate/libass etc.


Multiple possible solutions come to mind:

  (a) backend-specific overlay elements

      e.g. vaapitextoverlay, vdpautextoverlay, vaapidvdspu, vdpaudvdspu,
      vaapidvbsuboverlay, vdpaudvbsuboverlay, etc.

      This assumes the overlay can be done directly on the backend-specific
      object passed around.

      The main drawback with this solution is that it leads to a lot of
      code duplication and may also lead to uncertainty about distributing
      certain duplicated pieces of code. The code duplication is pretty
      much unavoidable, since making textoverlay, dvbsuboverlay, dvdspu,
      kate, assrender, etc. available in form of base classes to derive
      from is not really an option. Similarly, one would not really want
      the vaapi/vdpau plugin to depend on a bunch of other libraries
      such as libpango, libkate, libtiger, libass, etc.

      One could add some new kind of overlay plugin feature though in
      combination with a generic base class of some sort, but in order
      to accommodate all the different cases and formats one would end
      up with quite convoluted/tricky API.

      (Of course there could also be a GstFancyVideoBuffer that provides
      an abstraction for such video accelerated objects and that could
      provide an API to add overlays to it in a generic way, but in the
      end this is just a less generic variant of (c), and it is not clear
      that there are real benefits to a specialised solution vs. a more
      generic one).


  (b) convert backend-specific object to raw pixels and then overlay

      Even where possible technically, this is most likely very
      inefficient.


  (c) attach the overlay data to the backend-specific video frame buffers
      in a generic way and do the actual overlaying/blitting later in
      backend-specific code such as the video sink (or an accelerated
      encoder/transcoder)

      In this case, the actual overlay rendering (i.e. the actual text
      rendering or decoding DVD/DVB data into pixels) is done in the
      subtitle-format-specific GStreamer plugin. All knowledge about
      the subtitle format is contained in the overlay plugin then,
      and all knowledge about the video backend in the video backend
      specific plugin.

      The main question then is how to get the overlay pixels (and
      we will only deal with pixels here) from the overlay element
      to the video sink.

      This could be done in multiple ways: One could send custom
      events downstream with the overlay data, or one could attach
      the overlay data directly to the video buffers in some way.

      Sending inline events has the advantage that is is fairly
      transparent to any elements between the overlay element and
      the video sink: if an effects plugin creates a new video
      buffer for the output, nothing special needs to be done to
      maintain the subtitle overlay information, since the overlay
      data is not attached to the buffer. However, it slightly
      complicates things at the sink, since it would also need to
      look for the new event in question instead of just processing
      everything in its buffer render function.

      If one attaches the overlay data to the buffer directly, any
      element between overlay and video sink that creates a new
      video buffer would need to be aware of the overlay data
      attached to it and copy it over to the newly-created buffer.

      One would have to do implement a special kind of new query
      (e.g. FEATURE query) that is not passed on automatically by
      gst_pad_query_default() in order to make sure that all elements
      downstream will handle the attached overlay data. (This is only
      a problem if we want to also attach overlay data to raw video
      pixel buffers; for new non-raw types we can just make it
      mandatory and assume support and be done with it; for existing
      non-raw types nothing changes anyway if subtitles don't work)
      (we need to maintain backwards compatibility for existing raw
      video pipelines like e.g.:  ..decoder ! suboverlay ! encoder..)

      Even though slightly more work, attaching the overlay information
      to buffers seems more intuitive than sending it interleaved as
      events. And buffers stored or passed around (e.g. via the
      "last-buffer" property in the sink when doing screenshots via
      playbin2) always contain all the information needed.


  (d) create a video/x-raw-*-delta format and use a backend-specific videomixer

      This possibility was hinted at already in the digression in
      section 1. It would satisfy the goal of keeping subtitle format
      knowledge in the subtitle plugins and video backend knowledge
      in the video backend plugin. It would also add a concept that
      might be generally useful (think ximagesrc capture with xdamage).
      However, it would require adding foorender variants of all the
      existing overlay elements, and changing playbin2 to that new
      design, which is somewhat intrusive. And given the general
      nature of such a new format/API, we would need to take a lot
      of care to be able to accommodate all possible use cases when
      designing the API, which makes it considerably more ambitious.
      Lastly, we would need to write videomixer variants for the
      various accelerated video backends as well.


Overall (c) appears to be the most promising solution. It is the least
intrusive and should be fairly straight-forward to implement with
reasonable effort, requiring only small changes to existing elements
and requiring no new elements.

Doing the final overlaying in the sink as opposed to a videomixer
or overlay in the middle of the pipeline has other advantages:

 - if video frames need to be dropped, e.g. for QoS reasons,
   we could also skip the actual subtitle overlaying and
   possibly the decoding/rendering as well, if the
   implementation and API allows for that to be delayed.

 - the sink often knows the actual size of the window/surface/screen
   the output video is rendered to. This *may* make it possible to
   render the overlay image in a higher resolution than the input
   video, solving a long standing issue with pixelated subtitles on
   top of low-resolution videos that are then scaled up in the sink.
   This would require for the rendering to be delayed of course instead
   of just attaching an AYUV/ARGB/RGBA blog of pixels to the video buffer
   in the overlay, but that could all be supported.

 - if the video backend / sink has support for high-quality text
   rendering (clutter?) we could just pass the text or pango markup
   to the sink and let it do the rest (this is unlikely to be
   supported in the general case - text and glyph rendering is
   hard; also, we don't really want to make up our own text markup
   system, and pango markup is probably too limited for complex
   karaoke stuff).


 === 4. API needed ===

  (a) Representation of subtitle overlays to be rendered

      We need to pass the overlay pixels from the overlay element to the
      sink somehow. Whatever the exact mechanism, let's assume we pass
      a refcounted GstVideoOverlayComposition struct or object.

      A composition is made up of one or more overlays/rectangles.

      In the simplest case an overlay rectangle is just a blob of
      RGBA/ABGR [FIXME?] or AYUV pixels with positioning info and other
      metadata, and there is only one rectangle to render.

      We're keeping the naming generic ("OverlayFoo" rather than
      "SubtitleFoo") here, since this might also be handy for
      other use cases such as e.g. logo overlays or so. It is not
      designed for full-fledged video stream mixing though.

        // Note: don't mind the exact implementation details, they'll be hidden

        // FIXME: might be confusing in 0.11 though since GstXOverlay was
        //        renamed to GstVideoOverlay in 0.11, but not much we can do,
        //        maybe we can rename GstVideoOverlay to something better

        struct GstVideoOverlayComposition
        {
            guint                          num_rectangles;
            GstVideoOverlayRectangle    ** rectangles;

            /* lowest rectangle sequence number still used by the upstream
             * overlay element. This way a renderer maintaining some kind of
             * rectangles <-> surface cache can know when to free cached
             * surfaces/rectangles. */
            guint                          min_seq_num_used;

            /* sequence number for the composition (same series as rectangles) */
            guint                          seq_num;
        }

        struct GstVideoOverlayRectangle
        {
            /* Position on video frame and dimension of output rectangle in
             * output frame terms (already adjusted for the PAR of the output
             * frame). x/y can be negative (overlay will be clipped then) */
            gint  x, y;
            guint render_width, render_height;

            /* Dimensions of overlay pixels */
            guint width, height, stride;

            /* This is the PAR of the overlay pixels */
            guint par_n, par_d;

            /* Format of pixels, GST_VIDEO_FORMAT_ARGB on big-endian systems,
             * and BGRA on little-endian systems (i.e. pixels are treated as
             * 32-bit values and alpha is always in the most-significant byte,
             * and blue is in the least-significant byte).
             *
             * FIXME: does anyone actually use AYUV in practice? (we do
             * in our utility function to blend on top of raw video)
             * What about AYUV and endianness? Do we always have [A][Y][U][V]
             * in memory? */
            /* FIXME: maybe use our own enum? */
            GstVideoFormat format;

            /* Refcounted blob of memory, no caps or timestamps */
            GstBuffer *pixels;

            // FIXME: how to express source like text or pango markup?
            //        (just add source type enum + source buffer with data)
            //
            // FOR 0.10: always send pixel blobs, but attach source data in
            // addition (reason: if downstream changes, we can't renegotiate
            // that properly, if we just do a query of supported formats from
            // the start). Sink will just ignore pixels and use pango markup
            // from source data if it supports that.
            //
            // FOR 0.11: overlay should query formats (pango markup, pixels)
            // supported by downstream and then only send that. We can
            // renegotiate via the reconfigure event.
            //

            /* sequence number: useful for backends/renderers/sinks that want
             * to maintain a cache of rectangles <-> surfaces. The value of
             * the min_seq_num_used in the composition tells the renderer which
             * rectangles have expired. */
            guint      seq_num;

            /* FIXME: we also need a (private) way to cache converted/scaled
             * pixel blobs */
        }

      (a1) Overlay consumer API:

        How would this work in a video sink that supports scaling of textures:

        gst_foo_sink_render () {
          /* assume only one for now */
          if video_buffer has composition:
            composition = video_buffer.get_composition()

            for each rectangle in composition:
              if rectangle.source_data_type == PANGO_MARKUP
                actor = text_from_pango_markup (rectangle.get_source_data())
              else
                pixels = rectangle.get_pixels_unscaled (FORMAT_RGBA, ...)
                actor = texture_from_rgba (pixels, ...)

              .. position + scale on top of video surface ...
        }

      (a2) Overlay producer API:

        e.g. logo or subpicture overlay: got pixels, stuff into rectangle:

         if (logoverlay->cached_composition == NULL) {
           comp = composition_new ();

           rect = rectangle_new (format, pixels_buf,
                                 width, height, stride, par_n, par_d,
                                 x, y, render_width, render_height);

           /* composition adds its own ref for the rectangle */
           composition_add_rectangle (comp, rect);
           rectangle_unref (rect);

           /* buffer adds its own ref for the composition */
           video_buffer_attach_composition (comp);

           /* we take ownership of the composition and save it for later */
           logoverlay->cached_composition = comp;
         } else {
           video_buffer_attach_composition (logoverlay->cached_composition);
         }

      FIXME: also add some API to modify render position/dimensions of
      a rectangle (probably requires creation of new rectangle, unless
      we handle writability like with other mini objects).

  (b) Fallback overlay rendering/blitting on top of raw video

      Eventually we want to use this overlay mechanism not only for
      hardware-accelerated video, but also for plain old raw video,
      either at the sink or in the overlay element directly.

      Apart from the advantages listed earlier in section 3, this
      allows us to consolidate a lot of overlaying/blitting code that
      is currently repeated in every single overlay element in one
      location. This makes it considerably easier to support a whole
      range of raw video formats out of the box, add SIMD-optimised
      rendering using ORC, or handle corner cases correctly.

      (Note: side-effect of overlaying raw video at the video sink is
      that if e.g. a screnshotter gets the last buffer via the last-buffer
      property of basesink, it would get an image without the subtitles
      on top. This could probably be fixed by re-implementing the
      property in GstVideoSink though. Playbin2 could handle this
      internally as well).

        void
        gst_video_overlay_composition_blend (GstVideoOverlayComposition * comp
                                             GstBuffer                  * video_buf)
        {
          guint n;

          g_return_if_fail (gst_buffer_is_writable (video_buf));
          g_return_if_fail (GST_BUFFER_CAPS (video_buf) != NULL);

          ... parse video_buffer caps into BlendVideoFormatInfo ...

          for each rectangle in the composition: {

                 if (gst_video_format_is_yuv (video_buf_format)) {
                   overlay_format = FORMAT_AYUV;
                 } else if (gst_video_format_is_rgb (video_buf_format)) {
                   overlay_format = FORMAT_ARGB;
                 } else {
                   /* FIXME: grayscale? */
                   return;
                 }

                 /* this will scale and convert AYUV<->ARGB if needed */
                 pixels = rectangle_get_pixels_scaled (rectangle, overlay_format);

                 ... clip output rectangle ...

                 __do_blend (video_buf_format, video_buf->data,
                             overlay_format, pixels->data,
                             x, y, width, height, stride);

                 gst_buffer_unref (pixels);
          }
        }


  (c) Flatten all rectangles in a composition

      We cannot assume that the video backend API can handle any
      number of rectangle overlays, it's possible that it only
      supports one single overlay, in which case we need to squash
      all rectangles into one.

      However, we'll just declare this a corner case for now, and
      implement it only if someone actually needs it. It's easy
      to add later API-wise. Might be a bit tricky if we have
      rectangles with different PARs/formats (e.g. subs and a logo),
      though we could probably always just use the code from (b)
      with a fully transparent video buffer to create a flattened
      overlay buffer.

  (d) core API: new FEATURE query

      For 0.10 we need to add a FEATURE query, so the overlay element
      can query whether the sink downstream and all elements between
      the overlay element and the sink support the new overlay API.
      Elements in between need to support it because the render
      positions and dimensions need to be updated if the video is
      cropped or rescaled, for example.

      In order to ensure that all elements support the new API,
      we need to drop the query in the pad default query handler
      (so it only succeeds if all elements handle it explicitly).

      Might want two variants of the feature query - one where
      all elements in the chain need to support it explicitly
      and one where it's enough if some element downstream
      supports it.

      In 0.11 this could probably be handled via GstMeta and
      ALLOCATION queries (and/or we could simply require
      elements to be aware of this API from the start).

      There appears to be no issue with downstream possibly
      not being linked yet at the time when an overlay would
      want to do such a query.


Other considerations:

 - renderers (overlays or sinks) may be able to handle only ARGB or only AYUV
   (for most graphics/hw-API it's likely ARGB of some sort, while our
   blending utility functions will likely want the same colour space as
   the underlying raw video format, which is usually YUV of some sort).
   We need to convert where required, and should cache the conversion.

 - renderers may or may not be able to scale the overlay. We need to
   do the scaling internally if not (simple case: just horizontal scaling
   to adjust for PAR differences; complex case: both horizontal and vertical
   scaling, e.g. if subs come from a different source than the video or the
   video has been rescaled or cropped between overlay element and sink).

 - renderers may be able to generate (possibly scaled) pixels on demand
   from the original data (e.g. a string or RLE-encoded data). We will
   ignore this for now, since this functionality can still be added later
   via API additions. The most interesting case would be to pass a pango
   markup string, since e.g. clutter can handle that natively.

 - renderers may be able to write data directly on top of the video pixels
   (instead of creating an intermediary buffer with the overlay which is
   then blended on top of the actual video frame), e.g. dvdspu, dvbsuboverlay

   However, in the interest of simplicity, we should probably ignore the
   fact that some elements can blend their overlays directly on top of the
   video (decoding/uncompressing them on the fly), even more so as it's
   not obvious that it's actually faster to decode the same overlay
   70-90 times (say) (ie. ca. 3 seconds of video frames) and then blend
   it 70-90 times instead of decoding it once into a temporary buffer
   and then blending it directly from there, possibly SIMD-accelerated.
   Also, this is only relevant if the video is raw video and not some
   hardware-acceleration backend object.

   And ultimately it is the overlay element that decides whether to do
   the overlay right there and then or have the sink do it (if supported).
   It could decide to keep doing the overlay itself for raw video and
   only use our new API for non-raw video.

 - renderers may want to make sure they only upload the overlay pixels once
   per rectangle if that rectangle recurs in subsequent frames (as part of
   the same composition or a different composition), as is likely. This caching
   of e.g. surfaces needs to be done renderer-side and can be accomplished
   based on the sequence numbers. The composition contains the lowest
   sequence number still in use upstream (an overlay element may want to
   cache created compositions+rectangles as well after all to re-use them
   for multiple frames), based on that the renderer can expire cached
   objects. The caching needs to be done renderer-side because attaching
   renderer-specific objects to the rectangles won't work well given the
   refcounted nature of rectangles and compositions, making it unpredictable
   when a rectangle or composition will be freed or from which thread
   context it will be freed. The renderer-specific objects are likely bound
   to other types of renderer-specific contexts, and need to be managed
   in connection with those.

 - composition/rectangles should internally provide a certain degree of
   thread-safety. Multiple elements (sinks, overlay element) might access
   or use the same objects from multiple threads at the same time, and it
   is expected that elements will keep a ref to compositions and rectangles
   they push downstream for a while, e.g. until the current subtitle
   composition expires.

 === 5. Future considerations ===

 - alternatives: there may be multiple versions/variants of the same subtitle
   stream. On DVDs, there may be a 4:3 version and a 16:9 version of the same
   subtitles. We could attach both variants and let the renderer pick the best
   one  for the situation (currently we just use the 16:9 version). With totem,
   it's ultimately totem that adds the 'black bars' at the top/bottom, so totem
   also knows if it's got a 4:3 display and can/wants to fit 4:3 subs (which
   may render on top of the bars) or not, for example.

 === 6. Misc. FIXMEs ===

TEST: should these look (roughly) alike (note text distortion) - needs fixing in textoverlay

gst-launch-0.10 \
    videotestsrc ! video/x-raw,width=640,height=480,pixel-aspect-ratio=1/1 ! textoverlay text=Hello font-desc=72 ! xvimagesink \
    videotestsrc ! video/x-raw,width=320,height=480,pixel-aspect-ratio=2/1 ! textoverlay text=Hello font-desc=72 ! xvimagesink \
    videotestsrc ! video/x-raw,width=640,height=240,pixel-aspect-ratio=1/2 ! textoverlay text=Hello font-desc=72 ! xvimagesink

 ~~~ THE END ~~~