From b635c6914547f254722d2f8dd29b39cb5dff7615 Mon Sep 17 00:00:00 2001
From: Edward Hervey <edward.hervey@collabora.co.uk>
Date: Tue, 17 Jan 2012 17:17:24 +0100
Subject: [PATCH] design: First go at hardware-acceleration design doc

---
 docs/design/draft-hw-acceleration.txt | 427 ++++++++++++++++++++++++++
 1 file changed, 427 insertions(+)
 create mode 100644 docs/design/draft-hw-acceleration.txt

diff --git a/docs/design/draft-hw-acceleration.txt b/docs/design/draft-hw-acceleration.txt
new file mode 100644
index 0000000000..42d53720c6
--- /dev/null
+++ b/docs/design/draft-hw-acceleration.txt
@@ -0,0 +1,427 @@
+Hardware Acceleration in GStreamer 1.0
+--------------------------------------
+
+Status : DRAFT
+
+
+Preamble:
+
+  This document serves to identify and define the various usages of
+  hardware-acceleration (hereafter hwaccel) in GStreamer 1.0, the
+  problems that arise and need to be solved, and a proposal API.
+
+
+Out of scope:
+
+  This document will initially limit itself to usage of hwaccel in the
+  field of video capture, processing and display due to their
+  complexity.
+  It is not excluded that some parts of the research could be
+  applicable to other fields (audio, text, generic media).
+
+  This document will not cover how encoded data is parsed and
+  fed/obtained to/from the various hardware subsystems.
+
+
+Overall Goal:
+
+  Make the most of the underlying hardware features while at the same
+  time not introduce any noticable overhead [0] and provide the
+  biggest flexibility of use-cases possible.
+
+
+Secondary Goals:
+
+  Avoid Providing a system that only allows (efficient) usage of one
+  use-case and/or through a specific combination or elements. This is
+  contrary to the principles of GStreamer.
+
+  Not introduce any unneeded memory copies.
+
+  Not introduce any extra latency.
+
+  Process data asynchronously wherever possible.
+
+
+Terminology:
+
+  Due to the limitations of the GStreamer 0.10 API, most of these
+  element, especially sink elements, were named "non-raw video
+  elements".
+  In the rest of this document, we will no longer refer to them as
+  non-raw since they _do_ handle raw video and in GStreamer 1.0 it no
+  longer matters where the raw video is located or accessed. We will
+  prefer the term "hardware-accelerated video element".
+
+
+Specificities:
+
+  Hardware-accelerated elements differ from non-hwaccel elements in a
+  few ways:
+
+  * They handle memory which ,in the vast majority of the cases, is
+    not accessible directly.
+  * The processing _can_ happen asynchronously
+  * They _might_ be part of a GPU sub-system and therefore tightly
+    coupled to the display system.
+
+
+Features handled:
+
+  HW-accelerated elements can handle a variety of individual logical
+  features. These should, in the spirit of GStreamer, be controlable
+  in an individual fashion.
+
+  * Video decoding and encoding
+  * Display
+  * Capture
+  * Scaling (Downscaling (preview), Upscaling (Super-resolution))
+  * Deinterlacing (including inverse-telecine)
+  * Post-processing (Noise reduction, ...)
+  * Colorspace conversion
+  * Overlaying and compositing
+
+
+Use-cases:
+----------
+
+UC1 : HW-accelerated video decoding to counterpart sink
+
+  Example : * VDPAU decoder to VDPAU sink
+            * libVA decoder to libVA sink
+
+  In these situations, the HW-accelerated decoder and sink can use the
+  same API to communicate with each other and share data.
+
+  There might be extra processing that can be applied before display
+  (deinterlacing, noise reduction, overlaying, ...) and that is
+  provided by the backing hardware. All these features should be
+  usable in a transparent fashion from GStreamer.
+
+  They might also need to communicate/share a common context.
+
+
+UC2 : HW-accelerated video decoding to different hwaccel sink
+
+  Example : * VDPAU/libVA decoder to OpenGL-based sink
+
+  The goal here is to end up with the decoded pictures as openGL
+  textures, which can then be used in an openGL scene (with all the
+  transformations one can do with those textures).
+
+  GStreamer is responsible for:
+  1) Filling the contents of those textures
+  2) Informing the application when to use which texture at which time
+    (i.e. synchronization).
+
+  How the textures are used is not the responsibility of GStreamer,
+  although a fallback could be possible (displaying the texture in a
+  specified X window for ex) if the application does not handle the
+  OpenGL scene.
+
+  Efficient usage is only possible if the HW-accelerated system
+  provides an API by which one can either:
+  * Be given openGL texture IDs for the decoder to decode into
+  * OR 'transform' hwaccel-backed buffers into texture IDs
+
+  Just as for UC1, some information will need to be exchanged between
+  the OpenGL-backed elements and the other HW-accelerated element.
+
+
+UC3 : HW-accelerated decoding to HW-accelerated encoding
+
+  This is needed in cases where we want to reencode a stream from one
+  format/profile to another format/profile, like for example for
+  UPNP/DLNA embedded devices.
+
+  If the encoder and decoder are using the same backing hardware, this
+  is similar to UC1.
+
+  If the encoder and decoder are backed by 1) different hardware but
+  there is an API allowing communication between the two, OR 2) the
+  same hardware but through different APIs this is similar to UC2.
+
+  If the hardware backing the encoder and decoder don't have direct
+  communication means, then best-effort must be ensured to only
+  introduce one copy. The recent ongoing improvements in the kernel
+  regarding DMA usage could help in that regards, allowing some
+  hardware to be aware of another hardware.
+
+
+UC4 : HW-accelerated decoding to software plugin
+
+  Examples : * Transcoding a stream using a software encoder
+             * Applying measurement/transformations
+             * Your crazy idea here
+             * ...
+
+  While the most common usage of HW-accelerated decoding is for
+  display, we do not want to limit users of the GStreamer framework to
+  only be able to use those plugins in some limited use-cases. Users
+  should be able to benefit from the acceleration in any use-cases.
+
+
+UC5 : Software element to HW-accelerated display
+
+  Examples : * Software decoder to VA/VDPAU/GL/.. sink
+             * Visualization to VA/VDPAU/GL/... sink
+             * anything in fact
+
+  We need to ensure in these cases that any GStreamer plugin can
+  output data to a HW-accelerated display.
+
+  This process must not introduce any unwanted synchronization issues,
+  meaning the transfer to the backing hardware needs to happen before
+  the synchronization time in the sinks.
+
+
+UC6 : HW-accelerated capture to HW-accelerated encoder
+
+  Examples : * Camerabin usage
+             * Streaming server
+             * Video-over-IP
+             * ...
+
+  In order to provide not only low-cpu usage (through HW-accelerated
+  encoding) but also low-latency, we need to be able to have capture
+  hardware provide the data to be encoded in such a way that the
+  encoder can read it without any copy.
+
+  Some capture APIs provide means by which the hardware can be
+  provided by a pool of buffers backed by some MMAP contiguous
+  memory.
+
+
+UC6.1 : UC6 + simultaneous preview
+
+  Examples : Camerabin usage (preview of video/photo while shooting)
+
+
+
+Problems:
+---------
+
+P1 : Ranking of decoders
+
+  How do we pick the best decoder available ? Do we just set the
+  ranking of hardware-accelerated plugins to higher ranks ?
+
+
+P2 : Capabilities of HW-accelerated decoders
+
+  Hardware decoders can have much tighter constraints as to what they
+  can handle (limitations in sizes, bitrate, profile, level,
+  ...).
+
+  These limitations might be known without probbing the hardware, but
+  in most cases they require querying it.
+  Getting as much information about the stream to decode is needed.
+  This can be obtained through parsers and only look for a decoder
+  once the parser has provided extensive caps.
+
+
+P3 : Finding and auto-plugging the best elements
+
+  Taking the case where several decoders are available and several
+  sink elements are available, how do we establish which is the best
+  combination ?
+
+  Assuming we take the highest-ranked (and compatible) decoder, how do
+  we figure out which sink element is compatible ?
+
+  Assuming the user/application selects a specific sink, how do we
+  figure out which is the best decoder to use ?
+
+  /!\ Caps are not longer sufficient to establish compatibility
+
+
+P4 : How to handle systems that require calls to happen in one thread
+
+  In OpenGL (for example) calls can only be done from one thread,
+  which might not be a GStreamer thread (the sink could be controlled
+  from an application thread).
+
+  How do we properly (and safely) handle buffers and contexts ? Do we
+  create an API that allows marshalling processing into the proper
+  thread (resulting in an asynchronous API from the GStreamer point of
+  view) ?
+
+
+
+Proposal Design:
+
+D1 : GstCaps
+
+  We use the "video/x-raw" GstCaps.
+
+  The format field and other required fields are filled in the same
+  way they would be for non-HW-accelerated streams.
+
+
+D2 : Buffers and memory access
+
+  The buffers used/provided/consumed by the various HW-accelerated
+  elements must be usable with non-HW-accelerated elements.
+
+  To that extent, the GstMemory backing the various buffers must be
+  accessible via the mapping methods and therefore have the proper
+  GstAllocator implementation if-so required.
+
+  In the un-likelihood that the hardware does not provide any means to
+  map the memory or that there are such limitation (such as on DRM
+  systems), there should still be an implementation of
+  GstMemoryMapFunction that returns NULL (and a size/maxsize of zero)
+  when called.
+
+
+D3 : GstVideoMeta
+
+  In the same way that a custom GstAllocator is required, it is
+  important that elements implement the proper GstVideoMeta API
+  wherever applicable.
+
+  The GstVideoMeta fields should correspond to the memory returned by
+  a call to gst_buffer_map() and/or gst_video_meta_map().
+
+  => gst_video_meta_{map|unmap}() needs to call the
+     GstVideoMeta->{map|unmap} implementations
+
+
+D4 : Custom GstMeta
+
+  In order to pass along API and/or hardware-specific information
+  regarding the various buffers, the elements will be able to create
+  custom GstMeta.
+
+  Ex (For VDPAU):
+
+  struct _GstVDPAUMeta {
+     GstMeta         meta;
+
+     VdpDevice       device;
+     VdpVideoSurface surface;
+     ...
+  };
+
+  If an element supports multiple APIs for accessing/using the data
+  (like for example VDPAU and GLX), it should all the applicable
+  GstMeta.
+
+
+D5 : Buffer pools
+
+  In order to:
+  * avoid expensive cycles of buffer destruction/creation,
+  * allow upstream elements to end up with the optimal buffers/memory
+    to which to upload,
+  elements should implement GstBufferPools whenever possible.
+
+  If the backing hardware has a system by which it differentiates used
+  buffers and available buffers, the bufferpool should have the proper
+  release_buffer() and acquire_buffer() implementations.
+
+
+D6 : Ahead-of-time/asynchronous uploading
+
+  In the case where the buffers to be displayed are not on the target
+  hardware, we need to ensure the buffers are uploaded before the
+  synchronization time. If data is uploaded at the render time we will
+  end up with an unknown render latency, resulting in bad A/V
+  synchronization.
+
+  In order for this to happen, the buffers provided by downstream
+  elements should have a GstAllocator implementation allowing
+  uploading memory on _map(GST_MAP_WRITE).
+
+  If this uploading happens asynchronously, the GstAllocator should
+  implement a system so that if an intermediary element wishes to map
+  the memory it can do so (either by providing a cached version of the
+  memory, or by using locks).
+
+
+D7 : Overlay and positioning support
+
+  FIXME : Move to a separate design doc
+
+  struct _GstVideoCompositingMeta {
+    GstMeta               meta;
+
+    /* zorder : Depth Position of the layer in the final scene
+     *        0 = background
+     *    2**32 = foreground
+     */
+    guint                 zorder;
+
+    /* x,y    : Spatial position of the layer in the final scene
+     */
+    guint                 x;
+    guint                 y;
+
+    /* width/height : Target width/height of the layer in the
+     *   final scene.
+     */
+
+    guint                 width;
+    guint                 height;
+    /* basewidth/baseheight : Reference scene width/height
+     *   If both values are zero, the x/y/width/height values above
+     *   are to be used as absolute coordinates, regardless of the
+     *   final scene's width and height.
+     *   If the values are non-zero, the x/y/width/height values
+     *   above should be scaled based on those values.
+     *     Ex : real x position = x / basewidth * scene_width
+     */
+    guint                 basewidth;
+    guint                 baseheight;
+
+    /* alpha : Global alpha multiplier
+     *   0.0 = completely transparent
+     *   1.0 = no modification of original transparency (or opacity)
+     */
+    gdouble               alpha;
+  }
+
+
+D8 : De-interlacing support
+
+  FIXME : Move to a separate design doc
+
+  For systems that can apply deinterlacing, the user needs to be in
+  control of whether it should be applied or not.
+
+  This should be done through the usage of the deinterlace element.
+
+  In order to benefit from the HW-acceleration, downstream/upstream
+  elements need a way by which they can indicate that the
+  deinterlacing process will be applied later.
+
+  To this extent, we introduce a new GstMeta : GstDeinterlaceMeta
+
+  typedef const gchar *GstDeinterlaceMethod;
+
+  struct _GstDeinterlaceMeta {
+    GstMeta              meta;
+
+    GstDeinterlaceMethod method;
+  }
+
+
+D9 : Context sharing
+
+  Re-use parts of -bad's videocontext ?
+
+
+D10 : Non-MT-safe APIs
+
+  If the wrapped API/system does not offer an API which is MT-safe
+  and/or usable from more than one thread (like OpenGL), we need:
+  * A system by which a global context can be provided to all elements
+    wanting to use that system,
+  * A system by which elements can serialize processing to a 3rd party
+    thread.
+
+
+[0]: Defining "noticeable overhead" is always tricky, but essentially
+means that the overhead introduced by GStreamer core and the element
+code should not exceed the overhead introduced for non-hw-accelerated
+elements.