gstreamer/docs/design/draft-hw-acceleration.txt

427 lines
13 KiB
Text

Hardware Acceleration in GStreamer 1.0
--------------------------------------
Status : DRAFT
Preamble:
This document serves to identify and define the various usages of
hardware-acceleration (hereafter hwaccel) in GStreamer 1.0, the
problems that arise and need to be solved, and a proposal API.
Out of scope:
This document will initially limit itself to usage of hwaccel in the
field of video capture, processing and display due to their
complexity.
It is not excluded that some parts of the research could be
applicable to other fields (audio, text, generic media).
This document will not cover how encoded data is parsed and
fed/obtained to/from the various hardware subsystems.
Overall Goal:
Make the most of the underlying hardware features while at the same
time not introduce any noticable overhead [0] and provide the
biggest flexibility of use-cases possible.
Secondary Goals:
Avoid Providing a system that only allows (efficient) usage of one
use-case and/or through a specific combination or elements. This is
contrary to the principles of GStreamer.
Not introduce any unneeded memory copies.
Not introduce any extra latency.
Process data asynchronously wherever possible.
Terminology:
Due to the limitations of the GStreamer 0.10 API, most of these
element, especially sink elements, were named "non-raw video
elements".
In the rest of this document, we will no longer refer to them as
non-raw since they _do_ handle raw video and in GStreamer 1.0 it no
longer matters where the raw video is located or accessed. We will
prefer the term "hardware-accelerated video element".
Specificities:
Hardware-accelerated elements differ from non-hwaccel elements in a
few ways:
* They handle memory which ,in the vast majority of the cases, is
not accessible directly.
* The processing _can_ happen asynchronously
* They _might_ be part of a GPU sub-system and therefore tightly
coupled to the display system.
Features handled:
HW-accelerated elements can handle a variety of individual logical
features. These should, in the spirit of GStreamer, be controlable
in an individual fashion.
* Video decoding and encoding
* Display
* Capture
* Scaling (Downscaling (preview), Upscaling (Super-resolution))
* Deinterlacing (including inverse-telecine)
* Post-processing (Noise reduction, ...)
* Colorspace conversion
* Overlaying and compositing
Use-cases:
----------
UC1 : HW-accelerated video decoding to counterpart sink
Example : * VDPAU decoder to VDPAU sink
* libVA decoder to libVA sink
In these situations, the HW-accelerated decoder and sink can use the
same API to communicate with each other and share data.
There might be extra processing that can be applied before display
(deinterlacing, noise reduction, overlaying, ...) and that is
provided by the backing hardware. All these features should be
usable in a transparent fashion from GStreamer.
They might also need to communicate/share a common context.
UC2 : HW-accelerated video decoding to different hwaccel sink
Example : * VDPAU/libVA decoder to OpenGL-based sink
The goal here is to end up with the decoded pictures as openGL
textures, which can then be used in an openGL scene (with all the
transformations one can do with those textures).
GStreamer is responsible for:
1) Filling the contents of those textures
2) Informing the application when to use which texture at which time
(i.e. synchronization).
How the textures are used is not the responsibility of GStreamer,
although a fallback could be possible (displaying the texture in a
specified X window for ex) if the application does not handle the
OpenGL scene.
Efficient usage is only possible if the HW-accelerated system
provides an API by which one can either:
* Be given openGL texture IDs for the decoder to decode into
* OR 'transform' hwaccel-backed buffers into texture IDs
Just as for UC1, some information will need to be exchanged between
the OpenGL-backed elements and the other HW-accelerated element.
UC3 : HW-accelerated decoding to HW-accelerated encoding
This is needed in cases where we want to reencode a stream from one
format/profile to another format/profile, like for example for
UPNP/DLNA embedded devices.
If the encoder and decoder are using the same backing hardware, this
is similar to UC1.
If the encoder and decoder are backed by 1) different hardware but
there is an API allowing communication between the two, OR 2) the
same hardware but through different APIs this is similar to UC2.
If the hardware backing the encoder and decoder don't have direct
communication means, then best-effort must be ensured to only
introduce one copy. The recent ongoing improvements in the kernel
regarding DMA usage could help in that regards, allowing some
hardware to be aware of another hardware.
UC4 : HW-accelerated decoding to software plugin
Examples : * Transcoding a stream using a software encoder
* Applying measurement/transformations
* Your crazy idea here
* ...
While the most common usage of HW-accelerated decoding is for
display, we do not want to limit users of the GStreamer framework to
only be able to use those plugins in some limited use-cases. Users
should be able to benefit from the acceleration in any use-cases.
UC5 : Software element to HW-accelerated display
Examples : * Software decoder to VA/VDPAU/GL/.. sink
* Visualization to VA/VDPAU/GL/... sink
* anything in fact
We need to ensure in these cases that any GStreamer plugin can
output data to a HW-accelerated display.
This process must not introduce any unwanted synchronization issues,
meaning the transfer to the backing hardware needs to happen before
the synchronization time in the sinks.
UC6 : HW-accelerated capture to HW-accelerated encoder
Examples : * Camerabin usage
* Streaming server
* Video-over-IP
* ...
In order to provide not only low-cpu usage (through HW-accelerated
encoding) but also low-latency, we need to be able to have capture
hardware provide the data to be encoded in such a way that the
encoder can read it without any copy.
Some capture APIs provide means by which the hardware can be
provided by a pool of buffers backed by some MMAP contiguous
memory.
UC6.1 : UC6 + simultaneous preview
Examples : Camerabin usage (preview of video/photo while shooting)
Problems:
---------
P1 : Ranking of decoders
How do we pick the best decoder available ? Do we just set the
ranking of hardware-accelerated plugins to higher ranks ?
P2 : Capabilities of HW-accelerated decoders
Hardware decoders can have much tighter constraints as to what they
can handle (limitations in sizes, bitrate, profile, level,
...).
These limitations might be known without probbing the hardware, but
in most cases they require querying it.
Getting as much information about the stream to decode is needed.
This can be obtained through parsers and only look for a decoder
once the parser has provided extensive caps.
P3 : Finding and auto-plugging the best elements
Taking the case where several decoders are available and several
sink elements are available, how do we establish which is the best
combination ?
Assuming we take the highest-ranked (and compatible) decoder, how do
we figure out which sink element is compatible ?
Assuming the user/application selects a specific sink, how do we
figure out which is the best decoder to use ?
/!\ Caps are not longer sufficient to establish compatibility
P4 : How to handle systems that require calls to happen in one thread
In OpenGL (for example) calls can only be done from one thread,
which might not be a GStreamer thread (the sink could be controlled
from an application thread).
How do we properly (and safely) handle buffers and contexts ? Do we
create an API that allows marshalling processing into the proper
thread (resulting in an asynchronous API from the GStreamer point of
view) ?
Proposal Design:
D1 : GstCaps
We use the "video/x-raw" GstCaps.
The format field and other required fields are filled in the same
way they would be for non-HW-accelerated streams.
D2 : Buffers and memory access
The buffers used/provided/consumed by the various HW-accelerated
elements must be usable with non-HW-accelerated elements.
To that extent, the GstMemory backing the various buffers must be
accessible via the mapping methods and therefore have the proper
GstAllocator implementation if-so required.
In the un-likelihood that the hardware does not provide any means to
map the memory or that there are such limitation (such as on DRM
systems), there should still be an implementation of
GstMemoryMapFunction that returns NULL (and a size/maxsize of zero)
when called.
D3 : GstVideoMeta
In the same way that a custom GstAllocator is required, it is
important that elements implement the proper GstVideoMeta API
wherever applicable.
The GstVideoMeta fields should correspond to the memory returned by
a call to gst_buffer_map() and/or gst_video_meta_map().
=> gst_video_meta_{map|unmap}() needs to call the
GstVideoMeta->{map|unmap} implementations
D4 : Custom GstMeta
In order to pass along API and/or hardware-specific information
regarding the various buffers, the elements will be able to create
custom GstMeta.
Ex (For VDPAU):
struct _GstVDPAUMeta {
GstMeta meta;
VdpDevice device;
VdpVideoSurface surface;
...
};
If an element supports multiple APIs for accessing/using the data
(like for example VDPAU and GLX), it should all the applicable
GstMeta.
D5 : Buffer pools
In order to:
* avoid expensive cycles of buffer destruction/creation,
* allow upstream elements to end up with the optimal buffers/memory
to which to upload,
elements should implement GstBufferPools whenever possible.
If the backing hardware has a system by which it differentiates used
buffers and available buffers, the bufferpool should have the proper
release_buffer() and acquire_buffer() implementations.
D6 : Ahead-of-time/asynchronous uploading
In the case where the buffers to be displayed are not on the target
hardware, we need to ensure the buffers are uploaded before the
synchronization time. If data is uploaded at the render time we will
end up with an unknown render latency, resulting in bad A/V
synchronization.
In order for this to happen, the buffers provided by downstream
elements should have a GstAllocator implementation allowing
uploading memory on _map(GST_MAP_WRITE).
If this uploading happens asynchronously, the GstAllocator should
implement a system so that if an intermediary element wishes to map
the memory it can do so (either by providing a cached version of the
memory, or by using locks).
D7 : Overlay and positioning support
FIXME : Move to a separate design doc
struct _GstVideoCompositingMeta {
GstMeta meta;
/* zorder : Depth Position of the layer in the final scene
* 0 = background
* 2**32 = foreground
*/
guint zorder;
/* x,y : Spatial position of the layer in the final scene
*/
guint x;
guint y;
/* width/height : Target width/height of the layer in the
* final scene.
*/
guint width;
guint height;
/* basewidth/baseheight : Reference scene width/height
* If both values are zero, the x/y/width/height values above
* are to be used as absolute coordinates, regardless of the
* final scene's width and height.
* If the values are non-zero, the x/y/width/height values
* above should be scaled based on those values.
* Ex : real x position = x / basewidth * scene_width
*/
guint basewidth;
guint baseheight;
/* alpha : Global alpha multiplier
* 0.0 = completely transparent
* 1.0 = no modification of original transparency (or opacity)
*/
gdouble alpha;
}
D8 : De-interlacing support
FIXME : Move to a separate design doc
For systems that can apply deinterlacing, the user needs to be in
control of whether it should be applied or not.
This should be done through the usage of the deinterlace element.
In order to benefit from the HW-acceleration, downstream/upstream
elements need a way by which they can indicate that the
deinterlacing process will be applied later.
To this extent, we introduce a new GstMeta : GstDeinterlaceMeta
typedef const gchar *GstDeinterlaceMethod;
struct _GstDeinterlaceMeta {
GstMeta meta;
GstDeinterlaceMethod method;
}
D9 : Context sharing
Re-use parts of -bad's videocontext ?
D10 : Non-MT-safe APIs
If the wrapped API/system does not offer an API which is MT-safe
and/or usable from more than one thread (like OpenGL), we need:
* A system by which a global context can be provided to all elements
wanting to use that system,
* A system by which elements can serialize processing to a 3rd party
thread.
[0]: Defining "noticeable overhead" is always tricky, but essentially
means that the overhead introduced by GStreamer core and the element
code should not exceed the overhead introduced for non-hw-accelerated
elements.