Hardware Acceleration in GStreamer 1.0 -------------------------------------- Status : DRAFT Preamble: This document serves to identify and define the various usages of hardware-acceleration (hereafter hwaccel) in GStreamer 1.0, the problems that arise and need to be solved, and a proposal API. Out of scope: This document will initially limit itself to usage of hwaccel in the field of video capture, processing and display due to their complexity. It is not excluded that some parts of the research could be applicable to other fields (audio, text, generic media). This document will not cover how encoded data is parsed and fed/obtained to/from the various hardware subsystems. Overall Goal: Make the most of the underlying hardware features while at the same time not introduce any noticable overhead [0] and provide the biggest flexibility of use-cases possible. Secondary Goals: Avoid Providing a system that only allows (efficient) usage of one use-case and/or through a specific combination or elements. This is contrary to the principles of GStreamer. Not introduce any unneeded memory copies. Not introduce any extra latency. Process data asynchronously wherever possible. Terminology: Due to the limitations of the GStreamer 0.10 API, most of these element, especially sink elements, were named "non-raw video elements". In the rest of this document, we will no longer refer to them as non-raw since they _do_ handle raw video and in GStreamer 1.0 it no longer matters where the raw video is located or accessed. We will prefer the term "hardware-accelerated video element". Specificities: Hardware-accelerated elements differ from non-hwaccel elements in a few ways: * They handle memory which ,in the vast majority of the cases, is not accessible directly. * The processing _can_ happen asynchronously * They _might_ be part of a GPU sub-system and therefore tightly coupled to the display system. Features handled: HW-accelerated elements can handle a variety of individual logical features. These should, in the spirit of GStreamer, be controlable in an individual fashion. * Video decoding and encoding * Display * Capture * Scaling (Downscaling (preview), Upscaling (Super-resolution)) * Deinterlacing (including inverse-telecine) * Post-processing (Noise reduction, ...) * Colorspace conversion * Overlaying and compositing Use-cases: ---------- UC1 : HW-accelerated video decoding to counterpart sink Example : * VDPAU decoder to VDPAU sink * libVA decoder to libVA sink In these situations, the HW-accelerated decoder and sink can use the same API to communicate with each other and share data. There might be extra processing that can be applied before display (deinterlacing, noise reduction, overlaying, ...) and that is provided by the backing hardware. All these features should be usable in a transparent fashion from GStreamer. They might also need to communicate/share a common context. UC2 : HW-accelerated video decoding to different hwaccel sink Example : * VDPAU/libVA decoder to OpenGL-based sink The goal here is to end up with the decoded pictures as openGL textures, which can then be used in an openGL scene (with all the transformations one can do with those textures). GStreamer is responsible for: 1) Filling the contents of those textures 2) Informing the application when to use which texture at which time (i.e. synchronization). How the textures are used is not the responsibility of GStreamer, although a fallback could be possible (displaying the texture in a specified X window for ex) if the application does not handle the OpenGL scene. Efficient usage is only possible if the HW-accelerated system provides an API by which one can either: * Be given openGL texture IDs for the decoder to decode into * OR 'transform' hwaccel-backed buffers into texture IDs Just as for UC1, some information will need to be exchanged between the OpenGL-backed elements and the other HW-accelerated element. UC3 : HW-accelerated decoding to HW-accelerated encoding This is needed in cases where we want to reencode a stream from one format/profile to another format/profile, like for example for UPNP/DLNA embedded devices. If the encoder and decoder are using the same backing hardware, this is similar to UC1. If the encoder and decoder are backed by 1) different hardware but there is an API allowing communication between the two, OR 2) the same hardware but through different APIs this is similar to UC2. If the hardware backing the encoder and decoder don't have direct communication means, then best-effort must be ensured to only introduce one copy. The recent ongoing improvements in the kernel regarding DMA usage could help in that regards, allowing some hardware to be aware of another hardware. UC4 : HW-accelerated decoding to software plugin Examples : * Transcoding a stream using a software encoder * Applying measurement/transformations * Your crazy idea here * ... While the most common usage of HW-accelerated decoding is for display, we do not want to limit users of the GStreamer framework to only be able to use those plugins in some limited use-cases. Users should be able to benefit from the acceleration in any use-cases. UC5 : Software element to HW-accelerated display Examples : * Software decoder to VA/VDPAU/GL/.. sink * Visualization to VA/VDPAU/GL/... sink * anything in fact We need to ensure in these cases that any GStreamer plugin can output data to a HW-accelerated display. This process must not introduce any unwanted synchronization issues, meaning the transfer to the backing hardware needs to happen before the synchronization time in the sinks. UC6 : HW-accelerated capture to HW-accelerated encoder Examples : * Camerabin usage * Streaming server * Video-over-IP * ... In order to provide not only low-cpu usage (through HW-accelerated encoding) but also low-latency, we need to be able to have capture hardware provide the data to be encoded in such a way that the encoder can read it without any copy. Some capture APIs provide means by which the hardware can be provided by a pool of buffers backed by some MMAP contiguous memory. UC6.1 : UC6 + simultaneous preview Examples : Camerabin usage (preview of video/photo while shooting) Problems: --------- P1 : Ranking of decoders How do we pick the best decoder available ? Do we just set the ranking of hardware-accelerated plugins to higher ranks ? P2 : Capabilities of HW-accelerated decoders Hardware decoders can have much tighter constraints as to what they can handle (limitations in sizes, bitrate, profile, level, ...). These limitations might be known without probbing the hardware, but in most cases they require querying it. Getting as much information about the stream to decode is needed. This can be obtained through parsers and only look for a decoder once the parser has provided extensive caps. P3 : Finding and auto-plugging the best elements Taking the case where several decoders are available and several sink elements are available, how do we establish which is the best combination ? Assuming we take the highest-ranked (and compatible) decoder, how do we figure out which sink element is compatible ? Assuming the user/application selects a specific sink, how do we figure out which is the best decoder to use ? /!\ Caps are not longer sufficient to establish compatibility P4 : How to handle systems that require calls to happen in one thread In OpenGL (for example) calls can only be done from one thread, which might not be a GStreamer thread (the sink could be controlled from an application thread). How do we properly (and safely) handle buffers and contexts ? Do we create an API that allows marshalling processing into the proper thread (resulting in an asynchronous API from the GStreamer point of view) ? Proposal Design: D1 : GstCaps We use the "video/x-raw" GstCaps. The format field and other required fields are filled in the same way they would be for non-HW-accelerated streams. D2 : Buffers and memory access The buffers used/provided/consumed by the various HW-accelerated elements must be usable with non-HW-accelerated elements. To that extent, the GstMemory backing the various buffers must be accessible via the mapping methods and therefore have the proper GstAllocator implementation if-so required. In the un-likelihood that the hardware does not provide any means to map the memory or that there are such limitation (such as on DRM systems), there should still be an implementation of GstMemoryMapFunction that returns NULL (and a size/maxsize of zero) when called. D3 : GstVideoMeta In the same way that a custom GstAllocator is required, it is important that elements implement the proper GstVideoMeta API wherever applicable. The GstVideoMeta fields should correspond to the memory returned by a call to gst_buffer_map() and/or gst_video_meta_map(). => gst_video_meta_{map|unmap}() needs to call the GstVideoMeta->{map|unmap} implementations D4 : Custom GstMeta In order to pass along API and/or hardware-specific information regarding the various buffers, the elements will be able to create custom GstMeta. Ex (For VDPAU): struct _GstVDPAUMeta { GstMeta meta; VdpDevice device; VdpVideoSurface surface; ... }; If an element supports multiple APIs for accessing/using the data (like for example VDPAU and GLX), it should all the applicable GstMeta. D5 : Buffer pools In order to: * avoid expensive cycles of buffer destruction/creation, * allow upstream elements to end up with the optimal buffers/memory to which to upload, elements should implement GstBufferPools whenever possible. If the backing hardware has a system by which it differentiates used buffers and available buffers, the bufferpool should have the proper release_buffer() and acquire_buffer() implementations. D6 : Ahead-of-time/asynchronous uploading In the case where the buffers to be displayed are not on the target hardware, we need to ensure the buffers are uploaded before the synchronization time. If data is uploaded at the render time we will end up with an unknown render latency, resulting in bad A/V synchronization. In order for this to happen, the buffers provided by downstream elements should have a GstAllocator implementation allowing uploading memory on _map(GST_MAP_WRITE). If this uploading happens asynchronously, the GstAllocator should implement a system so that if an intermediary element wishes to map the memory it can do so (either by providing a cached version of the memory, or by using locks). D7 : Overlay and positioning support FIXME : Move to a separate design doc struct _GstVideoCompositingMeta { GstMeta meta; /* zorder : Depth Position of the layer in the final scene * 0 = background * 2**32 = foreground */ guint zorder; /* x,y : Spatial position of the layer in the final scene */ guint x; guint y; /* width/height : Target width/height of the layer in the * final scene. */ guint width; guint height; /* basewidth/baseheight : Reference scene width/height * If both values are zero, the x/y/width/height values above * are to be used as absolute coordinates, regardless of the * final scene's width and height. * If the values are non-zero, the x/y/width/height values * above should be scaled based on those values. * Ex : real x position = x / basewidth * scene_width */ guint basewidth; guint baseheight; /* alpha : Global alpha multiplier * 0.0 = completely transparent * 1.0 = no modification of original transparency (or opacity) */ gdouble alpha; } D8 : De-interlacing support FIXME : Move to a separate design doc For systems that can apply deinterlacing, the user needs to be in control of whether it should be applied or not. This should be done through the usage of the deinterlace element. In order to benefit from the HW-acceleration, downstream/upstream elements need a way by which they can indicate that the deinterlacing process will be applied later. To this extent, we introduce a new GstMeta : GstDeinterlaceMeta typedef const gchar *GstDeinterlaceMethod; struct _GstDeinterlaceMeta { GstMeta meta; GstDeinterlaceMethod method; } D9 : Context sharing Re-use parts of -bad's videocontext ? D10 : Non-MT-safe APIs If the wrapped API/system does not offer an API which is MT-safe and/or usable from more than one thread (like OpenGL), we need: * A system by which a global context can be provided to all elements wanting to use that system, * A system by which elements can serialize processing to a 3rd party thread. [0]: Defining "noticeable overhead" is always tricky, but essentially means that the overhead introduced by GStreamer core and the element code should not exceed the overhead introduced for non-hw-accelerated elements.