Adding cudaipc{src,sink} element for CUDA IPC support.
Implementation note:
* For the communication between end points, Win32 named-pipe
and unix domain socket will be used on Windows and Linux respectively.
* cudaipcsink behaves as a server, and all GPU resources will be owned by
the server process and exported for other processes, then cudaipcsrc
(client) will import each exported handle.
* User can select IPC mode via "ipc-mode" property of cudaipcsink.
There are two IPC mode, one is "legacy" which uses legacy CUDA IPC
method and the other is "mmap" which uses CUDA virtual memory API
with OS's resource handle sharing method such as DuplicateHandle()
on Windows. The "mmap" mode might be better than "legacy" in terms
of stability since it relies on OS's resource management but
it would consume more GPU memory than "legacy" mode.
Part-of: <https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4510>
Call input resource map functions (i.e., nvEncRegisterResource,
nvEncUnregisterResource, nvEncMapInputResource, and
nvEncUnmapInputResource) only once and reuse the mapped resources,
instead of per input frame map/unmap
Part-of: <https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/3884>
Wrap mapped decoder output surface using GstCudaMemory and
output without any copy operation. Also, for application to be able to
control the number of zero-copyable output surfaces,
"num-output-surfaces" property is added.
Part-of: <https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/3884>
Rewriting GstCudaConverter object, since the old implementation was not
well organized and it's hard to add new features.
Moreover, the conversion operations were not very optimized.
Major change of this implementation:
* Remove redundant intermediate conversion operations such as
any RGB -> ARGB(64) conversion or any YUV -> Y444 (or 16bits Y444).
That's not required most of cases. The only required case is
converting 24bits (such as RGB/BGR) packed format to 32bits format
because CUDA texture object does not support sampling 24bits format
* Use normalized sample fetching (i.e., [0, 1] range float value)
and also normalized coordinates system for CUDA texture.
It's consistent with the other graphics APIs such as Direct3D
and OpenGL, that makes sampling operations much easier.
* Support a kind of viewport and adopt math for colorspace conversion
from GstD3D11 implementation
Part-of: <https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/3389>
GstCudaConverter object can do colorspace conversion and scale at once.
Adding new element "cudaconvertscale" to do that, this can
save unnecessary GPU operation if colorspace conversion and
rescale is required for given input stream format.
Most of codes are taken from d3d11convert element
Part-of: <https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/3389>
Adding new encoder elements nvd3d11{h264,h265}enc for Direct3D11
input support and re-written nvcuda{h264,h265}enc elements.
Newly writeen elements have some differences compared with old
nv{h264,h265}enc including non-backward compatible changes.
* RGBA is not a supported input format any more:
New elements will support only YUV formats to avoid implicit conversion
done by hardware. Ideally it should be done by upstream element
in order to have more control on it. Moreover, RGBA support can cause
redundant RGBA -> YUV conversion if multiple encoders are
used for the same RGBA input
* Subsampled planar format support is dropped:
I420 and YV12 format are not supported formats for Direct3D11.
Although it's supported in CUDA mode, it's not a hardware friendly
memory layout and it will waste GPU memory since UV planes
will have large padding due to the memory layout requirement of NVENC.
* GL support is dropped: Similar to the RGBA case,
GL support in encoder would be suboptimal if GL input is
used by multiple encoders, because each encoder will copy GL memory
into CUDA memory.
Upstream cudaupload element can be used for GL <-> CUDA
interop instead.
* No more pre-allocation of encoder input surfaces. New implementation
will use input CUDA memory without copy (zero-copy) or
will copy into a NVENC's input buffer struct in case of
system memory input.
Part-of: <https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1997>