Introduce GST_USE_NV_STATELESS_CODEC environment to allow user to select
primary nvidia decoder implementation. In case the environment
GST_USE_NV_STATELESS_CODEC=h264 was set, old nvidia h264 decoder element
will not be registered. Instead, both nvh264dec and nvh264sldec
factory name will create gstcodecs based new nvidia h264 decoder element.
Part-of: <https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/-/merge_requests/1198>
Introduce GstH264Decoder based Nvidia H.264 decoder element.
Similar the element factory name of to v4l2 stateless codec,
this element can be configured with factory name "gstnvh264sldec".
Note that "sl" in the name stands for "stateless"
For now, existing nvh264dec covers more profile and formats
(e.g., interlaced stream) than this implementation.
However, this implementation allows us to control lower level
parameters such as decoded picture buffer management and therefore
we can get a chance to improve performance in terms of latency.
Part-of: <https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/-/merge_requests/1198>
Too many decode surface would waste GPU memory. Also it seems to be
introducing additional latency depending on stream. Since nvcodec
sdk version 9.0, CUVID parser API has been providing the minimum
required number of surface. By using it, we can save GPU memory
and reduce possible latency.
The class data with the caps in it will be leaked if the element is
registered but never instantiated. There is no way around this. Mark
the caps as such so that the leaks tracer does not warn about it.
This is the same as pad template caps getting leaked, which are also
marked as may-be-leaked. These objects are initialized exactly once,
and are 'global' data.
We've been using NvEncodeAPICreateInstance method to find the supported API
version, but that seems to be insufficient since there is a case
where plugin failed in opening encoding session even if NvEncodeAPICreateInstance
succeeded. Asking driver about the version would be the most certain way.
Setting the CUVID_PKT_DISCONTINUITY implies clearing any past information
about the stream in the decoder. The GStreamer discont flag is used for
discontinuity caused by a seek, for first buffer and if a buffer was
dropped. In the first two cases, the parsers and demuxers should ensure we
start from a synchronization point, so it's unlikely that delta will be
matched against the wrong state.
For packet lost, the discontinuity flag will prevent the decoder from doing
any concealment, with a result that ca be much worst visually, or freeze the
playback until an IDR is met. It's better to let the decoder handle that for
us.
Removing this flag, also workaround a but in NVidia parser that makes it
ignore our ENDOFFRAME flag and increase the latency by one frame.
This sets the CUVID_PKT_ENDOFPICTURE flag in order to inform the decoder that
we have a complete picture. This should remove one frame latency otherwise
introduce by NVidia parser.
We weren't using the correct calling convention when calling CUDA and
CUVID APIs. `CUDAAPI` is `__stdcall` on Windows. This was working fine
on x64 because `__stdcall` is ignored and there's no special calling
convention. However, on x86, we need to use `__stdcall`.
Hard-coded 16x16 resolution is likely to differ from the device's support
in most cases. If we can use NV_ENC_CAPS_WIDTH_MIN and NV_ENC_CAPS_HEIGHT_MIN,
update pad template with returned value.
GstNvBaseEnc::n_bufs was set from the previous encoding session
but it wasn't cleared after stop. That might result to invalid memory
access at the next start (no encoded data) and then stop sequence.
Instead of defining a variable for array length, use GArray::len
directly to avoid such confusion.
If the last flow was not GST_FLOW_OK, the encoding thread is not running
and there is nothing to pop from GAsyncQueue (this causes deadlock).
To prevent deadlock, just return the handle_frame without further encoding
process if the last flow was not GST_FLOW_OK. Note that the last flow
will be cleared per FLUSH_STOP and STREAM_START event.
The hard-coded upper bound 32 (or 48 depending on resolution) might waste
GPU memory and high resolution encoding causes OUT-OF-MEMORY allocation error
quite easily. This commit calculates the number of required pre-allocated
device memory based on encoding options and it can reduce the amount of device memory
used by nvenc.
NVDEC driver always uses input timestamp without adjustment
even if bframe encoding was enabled.
So DTS can be larger than PTS when bframe was enabled.
To ensure PTS >= DTS, we should adjust the timestamp manually
based on the PTS difference between the first
encoded frame and the second one. That's also the maximum PTS/DTS
difference.
To support rc-lookahead and bframe encoding, nvenc needs one more
staging queue, because NvEncEncodePicture can return NV_ENC_ERR_NEED_MORE_INPUT
but which was not considered so far.
As documented by NVENC programming guide, pending buffers should wait
other inputs until NvEncEncodePicture returns success.
New encoding flow is
- Submit raw picture buffer to encoder with NvEncEncodePicture
- The submitted input/output buffer pair will be queued to pending_queue
- If NvEncEncodePicture returned success, then move all pair in pending_queue
to final stage
- Otherwise, wait more input raw pictures.
Another change is dropping NV_ENC_LOCK_INPUT_BUFFER usage.
So now nvenc always uses CUDA memory input buffer. As a result,
both opengl and system memory handling are unified.
* The number of iteration is always one so the iteration is useless
and that makes code complicated.
* Also defining named structure can code mroe readable.
* g_free is null safe
New rate-control modes are introduced (if device can support)
* cbr-ld-hr: CBR low-delay high quality
* cbr-hq: CBR high quality
* vbr-hq: VBR high quality
Also, various configurable rate-control related properties are added.
Introducing new dynamic class between GstNvBaseEncClass and
each subclass to be able to access device specific properties and
capabilities from each subclass implementation side.
Do not restrict allowed maximum resolution depending on the
initial resolution. If new resolution is larger than previous one,
just re-init encode session.
Due to uncleared last flow, decoding after seek was never possible
(last_ret == GST_FLOW_FLUSHING).
nvdec dose not need to keep track of the previous flow return,
and actually the interest is data/even flow of the current handle_frame().
Implementing ::negotiate() method to support runtime output format
change. If downstream was reconfigured, baseclass will invoke
::negotiate() method, and nvdec should update output memory
type depending on downstream caps.
Input stream might be silently changed without ::set_format() call.
Since nvdec has internal parser, nvdec element can figure out the format change
by itself.
Register openGL resource only once per memory. Also if upstream
provides the registered information, reuse the information
instead of doing it again. This can improve performance dramatically
depending on system since the resource registration might cause
high overhead.