gstreamer/docs/design/draft-hw-acceleration.txt

Hardware Acceleration in GStreamer 1.0
--------------------------------------

Status : DRAFT


Preamble:

  This document serves to identify and define the various usages of
  hardware-acceleration (hereafter hwaccel) in GStreamer 1.0, the
  problems that arise and need to be solved, and a proposal API.


Out of scope:

  This document will initially limit itself to usage of hwaccel in the
  field of video capture, processing and display due to their
  complexity.
  It is not excluded that some parts of the research could be
  applicable to other fields (audio, text, generic media).

  This document will not cover how encoded data is parsed and
  fed/obtained to/from the various hardware subsystems.


Overall Goal:

  Make the most of the underlying hardware features while at the same
  time not introduce any noticable overhead [0] and provide the
  biggest flexibility of use-cases possible.


Secondary Goals:

  Avoid Providing a system that only allows (efficient) usage of one
  use-case and/or through a specific combination or elements. This is
  contrary to the principles of GStreamer.

  Not introduce any unneeded memory copies.

  Not introduce any extra latency.

  Process data asynchronously wherever possible.


Terminology:

  Due to the limitations of the GStreamer 0.10 API, most of these
  element, especially sink elements, were named "non-raw video
  elements".
  In the rest of this document, we will no longer refer to them as
  non-raw since they _do_ handle raw video and in GStreamer 1.0 it no
  longer matters where the raw video is located or accessed. We will
  prefer the term "hardware-accelerated video element".


Specificities:

  Hardware-accelerated elements differ from non-hwaccel elements in a
  few ways:

  * They handle memory which ,in the vast majority of the cases, is
    not accessible directly.
  * The processing _can_ happen asynchronously
  * They _might_ be part of a GPU sub-system and therefore tightly
    coupled to the display system.


Features handled:

  HW-accelerated elements can handle a variety of individual logical
  features. These should, in the spirit of GStreamer, be controlable
  in an individual fashion.

  * Video decoding and encoding
  * Display
  * Capture
  * Scaling (Downscaling (preview), Upscaling (Super-resolution))
  * Deinterlacing (including inverse-telecine)
  * Post-processing (Noise reduction, ...)
  * Colorspace conversion
  * Overlaying and compositing


Use-cases:
----------

UC1 : HW-accelerated video decoding to counterpart sink

  Example : * VDPAU decoder to VDPAU sink
            * libVA decoder to libVA sink

  In these situations, the HW-accelerated decoder and sink can use the
  same API to communicate with each other and share data.

  There might be extra processing that can be applied before display
  (deinterlacing, noise reduction, overlaying, ...) and that is
  provided by the backing hardware. All these features should be
  usable in a transparent fashion from GStreamer.

  They might also need to communicate/share a common context.


UC2 : HW-accelerated video decoding to different hwaccel sink

  Example : * VDPAU/libVA decoder to OpenGL-based sink

  The goal here is to end up with the decoded pictures as openGL
  textures, which can then be used in an openGL scene (with all the
  transformations one can do with those textures).

  GStreamer is responsible for:
  1) Filling the contents of those textures
  2) Informing the application when to use which texture at which time
    (i.e. synchronization).

  How the textures are used is not the responsibility of GStreamer,
  although a fallback could be possible (displaying the texture in a
  specified X window for ex) if the application does not handle the
  OpenGL scene.

  Efficient usage is only possible if the HW-accelerated system
  provides an API by which one can either:
  * Be given openGL texture IDs for the decoder to decode into
  * OR 'transform' hwaccel-backed buffers into texture IDs

  Just as for UC1, some information will need to be exchanged between
  the OpenGL-backed elements and the other HW-accelerated element.


UC3 : HW-accelerated decoding to HW-accelerated encoding

  This is needed in cases where we want to reencode a stream from one
  format/profile to another format/profile, like for example for
  UPNP/DLNA embedded devices.

  If the encoder and decoder are using the same backing hardware, this
  is similar to UC1.

  If the encoder and decoder are backed by 1) different hardware but
  there is an API allowing communication between the two, OR 2) the
  same hardware but through different APIs this is similar to UC2.

  If the hardware backing the encoder and decoder don't have direct
  communication means, then best-effort must be ensured to only
  introduce one copy. The recent ongoing improvements in the kernel
  regarding DMA usage could help in that regards, allowing some
  hardware to be aware of another hardware.


UC4 : HW-accelerated decoding to software plugin

  Examples : * Transcoding a stream using a software encoder
             * Applying measurement/transformations
             * Your crazy idea here
             * ...

  While the most common usage of HW-accelerated decoding is for
  display, we do not want to limit users of the GStreamer framework to
  only be able to use those plugins in some limited use-cases. Users
  should be able to benefit from the acceleration in any use-cases.


UC5 : Software element to HW-accelerated display

  Examples : * Software decoder to VA/VDPAU/GL/.. sink
             * Visualization to VA/VDPAU/GL/... sink
             * anything in fact

  We need to ensure in these cases that any GStreamer plugin can
  output data to a HW-accelerated display.

  This process must not introduce any unwanted synchronization issues,
  meaning the transfer to the backing hardware needs to happen before
  the synchronization time in the sinks.


UC6 : HW-accelerated capture to HW-accelerated encoder

  Examples : * Camerabin usage
             * Streaming server
             * Video-over-IP
             * ...

  In order to provide not only low-cpu usage (through HW-accelerated
  encoding) but also low-latency, we need to be able to have capture
  hardware provide the data to be encoded in such a way that the
  encoder can read it without any copy.

  Some capture APIs provide means by which the hardware can be
  provided by a pool of buffers backed by some MMAP contiguous
  memory.


UC6.1 : UC6 + simultaneous preview

  Examples : Camerabin usage (preview of video/photo while shooting)


Problems:
---------

P1 : Ranking of decoders

  How do we pick the best decoder available ? Do we just set the
  ranking of hardware-accelerated plugins to higher ranks ?


P2 : Capabilities of HW-accelerated decoders

  Hardware decoders can have much tighter constraints as to what they
  can handle (limitations in sizes, bitrate, profile, level,
  ...).

  These limitations might be known without probbing the hardware, but
  in most cases they require querying it.
  Getting as much information about the stream to decode is needed.
  This can be obtained through parsers and only look for a decoder
  once the parser has provided extensive caps.


P3 : Finding and auto-plugging the best elements

  Taking the case where several decoders are available and several
  sink elements are available, how do we establish which is the best
  combination ?

  Assuming we take the highest-ranked (and compatible) decoder, how do
  we figure out which sink element is compatible ?

  Assuming the user/application selects a specific sink, how do we
  figure out which is the best decoder to use ?

  /!\ Caps are not longer sufficient to establish compatibility


P4 : How to handle systems that require calls to happen in one thread

  In OpenGL (for example) calls can only be done from one thread,
  which might not be a GStreamer thread (the sink could be controlled
  from an application thread).

  How do we properly (and safely) handle buffers and contexts ? Do we
  create an API that allows marshalling processing into the proper
  thread (resulting in an asynchronous API from the GStreamer point of
  view) ?


Proposal Design:

D1 : GstCaps

  We use the "video/x-raw" GstCaps.

  The format field and other required fields are filled in the same
  way they would be for non-HW-accelerated streams.


D2 : Buffers and memory access

  The buffers used/provided/consumed by the various HW-accelerated
  elements must be usable with non-HW-accelerated elements.

  To that extent, the GstMemory backing the various buffers must be
  accessible via the mapping methods and therefore have the proper
  GstAllocator implementation if-so required.

  In the un-likelihood that the hardware does not provide any means to
  map the memory or that there are such limitation (such as on DRM
  systems), there should still be an implementation of
  GstMemoryMapFunction that returns NULL (and a size/maxsize of zero)
  when called.


D3 : GstVideoMeta

  In the same way that a custom GstAllocator is required, it is
  important that elements implement the proper GstVideoMeta API
  wherever applicable.

  The GstVideoMeta fields should correspond to the memory returned by
  a call to gst_buffer_map() and/or gst_video_meta_map().

  => gst_video_meta_{map|unmap}() needs to call the
     GstVideoMeta->{map|unmap} implementations


D4 : Custom GstMeta

  In order to pass along API and/or hardware-specific information
  regarding the various buffers, the elements will be able to create
  custom GstMeta.

  Ex (For VDPAU):

  struct _GstVDPAUMeta {
     GstMeta         meta;

     VdpDevice       device;
     VdpVideoSurface surface;
     ...
  };

  If an element supports multiple APIs for accessing/using the data
  (like for example VDPAU and GLX), it should all the applicable
  GstMeta.


D5 : Buffer pools

  In order to:
  * avoid expensive cycles of buffer destruction/creation,
  * allow upstream elements to end up with the optimal buffers/memory
    to which to upload,
  elements should implement GstBufferPools whenever possible.

  If the backing hardware has a system by which it differentiates used
  buffers and available buffers, the bufferpool should have the proper
  release_buffer() and acquire_buffer() implementations.


D6 : Ahead-of-time/asynchronous uploading

  In the case where the buffers to be displayed are not on the target
  hardware, we need to ensure the buffers are uploaded before the
  synchronization time. If data is uploaded at the render time we will
  end up with an unknown render latency, resulting in bad A/V
  synchronization.

  In order for this to happen, the buffers provided by downstream
  elements should have a GstAllocator implementation allowing
  uploading memory on _map(GST_MAP_WRITE).

  If this uploading happens asynchronously, the GstAllocator should
  implement a system so that if an intermediary element wishes to map
  the memory it can do so (either by providing a cached version of the
  memory, or by using locks).


D7 : Overlay and positioning support

  FIXME : Move to a separate design doc

  struct _GstVideoCompositingMeta {
    GstMeta               meta;

    /* zorder : Depth Position of the layer in the final scene
     *        0 = background
     *    2**32 = foreground
     */
    guint                 zorder;

    /* x,y    : Spatial position of the layer in the final scene
     */
    guint                 x;
    guint                 y;

    /* width/height : Target width/height of the layer in the
     *   final scene.
     */

    guint                 width;
    guint                 height;
    /* basewidth/baseheight : Reference scene width/height
     *   If both values are zero, the x/y/width/height values above
     *   are to be used as absolute coordinates, regardless of the
     *   final scene's width and height.
     *   If the values are non-zero, the x/y/width/height values
     *   above should be scaled based on those values.
     *     Ex : real x position = x / basewidth * scene_width
     */
    guint                 basewidth;
    guint                 baseheight;

    /* alpha : Global alpha multiplier
     *   0.0 = completely transparent
     *   1.0 = no modification of original transparency (or opacity)
     */
    gdouble               alpha;
  }


D8 : De-interlacing support

  FIXME : Move to a separate design doc

  For systems that can apply deinterlacing, the user needs to be in
  control of whether it should be applied or not.

  This should be done through the usage of the deinterlace element.

  In order to benefit from the HW-acceleration, downstream/upstream
  elements need a way by which they can indicate that the
  deinterlacing process will be applied later.

  To this extent, we introduce a new GstMeta : GstDeinterlaceMeta

  typedef const gchar *GstDeinterlaceMethod;

  struct _GstDeinterlaceMeta {
    GstMeta              meta;

    GstDeinterlaceMethod method;
  }


D9 : Context sharing

  Re-use parts of -bad's videocontext ?


D10 : Non-MT-safe APIs

  If the wrapped API/system does not offer an API which is MT-safe
  and/or usable from more than one thread (like OpenGL), we need:
  * A system by which a global context can be provided to all elements
    wanting to use that system,
  * A system by which elements can serialize processing to a 3rd party
    thread.


[0]: Defining "noticeable overhead" is always tricky, but essentially
means that the overhead introduced by GStreamer core and the element
code should not exceed the overhead introduced for non-hw-accelerated
elements.