mirror of
https://gitlab.freedesktop.org/gstreamer/gstreamer.git
synced 2024-12-27 10:40:34 +00:00
doc: Add analytics support design
Part-of: <https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/6139>
This commit is contained in:
parent
fd971ed9d8
commit
a04ed4b91e
1 changed files with 285 additions and 0 deletions
|
@ -0,0 +1,285 @@
|
|||
# Machine Learning Based Analytics
|
||||
|
||||
Analytics refer to the process of extracting information from the content of the
|
||||
media (or medias). The analysis can be spatial only, for example, image analysis, or
|
||||
temporal only, like sound detection, or even spatio-temporal tracking or action recognition,
|
||||
multi-modal image+sound to detect a environment or behaviour. There's also
|
||||
scenarios where the results of the analysis is used as the input, with or without an
|
||||
additional media. This design aim is to support ML-based analytics and CV
|
||||
analytics and offer a way to bridge both techniques.
|
||||
|
||||
## Vision
|
||||
|
||||
With this design we aim at allowing GStreamer application developers to develop
|
||||
analytics pipeline easily while taking full advantage of the acceleration
|
||||
available on the platform where they deploy. The effort of moving the analytic
|
||||
pipeline to a different platform should be minimal.
|
||||
|
||||
## Refinement Using Analytics Pipeline
|
||||
|
||||
Similarly to content agnostic media processing (ex. Scaling, color-space change,
|
||||
serialization, ...), this design promote re-usability and simplicity by allowing
|
||||
the composition of complex analytics pipelines from simple dedicated analytics
|
||||
elements that complement each other.
|
||||
|
||||
### Example
|
||||
Simple hypothetical example of an analytic pipeline.
|
||||
|
||||
```
|
||||
+---------+ +----------+ +---------------+ +----------------+
|
||||
| v4l2src | | video | | onnxinference | | tensor-decoder |
|
||||
| | | convert | | | | |
|
||||
| src-sink scale src-sink1 src1-sink src---
|
||||
| | |(pre-proc)| | (analysis) | | (post-proc) | /
|
||||
+---------+ +----------+ +---------------+ +----------------+ /
|
||||
/
|
||||
----------------------------------------------------------------------
|
||||
| +-------------+ +------+
|
||||
| | Analytic- | | sink |
|
||||
| | overlay | | |
|
||||
-sink src-sink |
|
||||
| (analysis | | |
|
||||
| -results | +------+
|
||||
| -consumer) |
|
||||
+-------------+
|
||||
|
||||
```
|
||||
|
||||
## Supporting Neural Network Inference
|
||||
|
||||
There are multiple frameworks supporting neural network inference. Those can be
|
||||
described more generally as computing graphs, as they are generally not limited
|
||||
to NN inference applications. Existing NN inference or computing graph frameworks,
|
||||
like ONNX-Runtime, are encapsulated into a GstElement/Filter. The inference element loads
|
||||
a model, describing the computing graph, specified by a property. The model
|
||||
expects inputs in a specific format and produce outputs in specific
|
||||
format. Depending on the model format, input/output formats can be extracted
|
||||
from the model, like with ONNX, but it is not always the case.
|
||||
|
||||
### Inference Element
|
||||
Inference elements are an encapsulation of an NN Inference framework. Therefore
|
||||
they are specific to a framework, like ONNX-Runtime or TensorFlow-Lite.
|
||||
Other inference elements can be added.
|
||||
|
||||
### Inference Input(s)
|
||||
The input format is defined by the model. Using the model input format the
|
||||
inference element can constrain its sinkpad(s) capabilities. Note, because tensors
|
||||
are very generic, the term also encapsulates images/frames, and the term input tensor is
|
||||
also used to describe inference input.
|
||||
|
||||
### Inference Output(s)
|
||||
Output(s) of the inference are tensors and their format are also dictated by the
|
||||
model. Analysis results are generally encoded in the output tensor in a way that
|
||||
is specific to the model. Even models that target the same type of analysis
|
||||
encode results in different ways.
|
||||
|
||||
### Models Format Not Describing Inputs/Outputs Tensor Format
|
||||
With some models, the input/output tensor format are not described. In
|
||||
this context, it's the responsibility of the analytics pipeline to push input
|
||||
tensors with the correct format into the inference process. In this context,
|
||||
the inference element designer is left with two choices: supporting a model manifest
|
||||
where inputs/outputs are described or leaving the constraining/fixing the
|
||||
inputs/outputs to analytics pipeline designer who can use caps filters to
|
||||
constrain inputs/outputs of the model.
|
||||
|
||||
### Tensor Decoders
|
||||
In order to preserve the generality of the inference element, tensor decoding is
|
||||
omitted from the inference element and left to specialized elements that have a
|
||||
specific task of decoding tensor from a specific model. Additionally
|
||||
tensor decoding does not depend on a specific NN framework or inference element,
|
||||
this allow reusing the tensor decoders with a same model used with a
|
||||
different inference element. For example, a YOLOv3 tensor decoder can used to
|
||||
decode tensor from inference using YOLOv3 model with an element encapsulating
|
||||
ONNX or TFLite. Note that a tensor decoder can handle multiple tensors that have
|
||||
similar encoding.
|
||||
|
||||
### Tensor
|
||||
N-dimensional vector.
|
||||
|
||||
#### Tensor Type Identifier
|
||||
This is an identifier, string or quark, that uniquely identifies a tensor type. The
|
||||
tensor type describes the specific format used to encode analysis result in
|
||||
memory. This identifier is used by tensor-decoders to know if they can handle
|
||||
the decoding of a tensor. For this reason, from an implementation perspective,
|
||||
the tensor decoder is the ideal location to store the tensor type identifier as the code
|
||||
is already model specific. Since the tensor decoder is by design specific to a
|
||||
model, no generality is lost by storing it the tensor type identifier.
|
||||
|
||||
#### Tensor Datatype
|
||||
This is the primitive type used to store tensor-data. Like `int8`,
|
||||
`uint8`, `float16`, `float32`, ...
|
||||
|
||||
#### Tensor Dimension Cardinality
|
||||
|
||||
Number of dimensions in the tensor.
|
||||
|
||||
#### Tensor Dimension
|
||||
|
||||
Tensor shape.
|
||||
|
||||
- [a], 1-dimensional vector
|
||||
- [a x b], 2-dimensional vector
|
||||
- [a x b x c], 3-dimensional vector
|
||||
- [a x b x ... x n], N-dimensional vector
|
||||
|
||||
### Tensor Decoders Need to Recognize Tensor(s) They Can Handle
|
||||
|
||||
As mention before, tensor decoders need to be able to recognize tensor(s) they can
|
||||
handle. It's important to keep in mind that multiple tensors can be attached to
|
||||
a buffer, when tensors are transported as a meta. It could be easy to
|
||||
believe that tensor's (cardinality + dimension + data type) is sufficient to
|
||||
recognize a specific tensor format but we need to remember that analysis results
|
||||
are encoded into the tensor and retrieve analysis results require a decoding
|
||||
process specific to the model. In other words a tensor A:{cardinality:3,
|
||||
dimension: 100 x 5, datatype:int8) and a tensor B:{cardinality:3, 100 x 5,
|
||||
datatype:int8) can have completely different meaning.
|
||||
|
||||
A could be: (Object-detection where each candidate is encoded with (top-left)
|
||||
coordinates, width, height and object location confidence level)
|
||||
|
||||
```
|
||||
0 : [ x1, y1, w, h, location confidence]
|
||||
1 : [ x1, y1, w, h, location confidence]
|
||||
...
|
||||
99: [ x1, y1, w, h, location confidence]
|
||||
```
|
||||
|
||||
B could be: (Object-detection where each candidate is encoded with (top-left)
|
||||
coordinates, (bottom-right) coordinate and object class confidence level)
|
||||
```
|
||||
0 : [ x1, y1, x2, y2, class confidence]
|
||||
1 : [ x1, y1, x2, y2, class confidence]
|
||||
...
|
||||
99: [ x1, y1, x2, y2, class confidence]
|
||||
```
|
||||
We can see that even if A and B have same (cardinality, dimension, data type) a
|
||||
tensor-decoder expecting A and decoding B would wrong.
|
||||
|
||||
In general, for high cardinality tensors, the risk of having two tensors with same
|
||||
(cardinality + dimension + data type) is low, but if we think of low cardinality
|
||||
tensors typical of classification (1 x C), we can see that the risk is much
|
||||
higher. For this reason, we believe it's not sufficient for tensor-decoder to
|
||||
only rely on (cardinality + dimension + data type) to identify tensor it can
|
||||
handle.
|
||||
|
||||
#### A Tensor Decoder's Second Job: Non-Maximum Suppression (NMS)
|
||||
|
||||
The main functionality of Tensor-Decoders is to extract analytics-results from tensors,
|
||||
but in addition to decoding tensors, in general a second phase of post-processing
|
||||
is handled by tensor-decoder. This post-processing phase is called non-maximum
|
||||
suppression (NMS). A simplest example of NMS, is with classification. For every
|
||||
input, the classification model will produce a probability for potential class.
|
||||
In general, we're mostly interested in the most probable class or few most
|
||||
probable class, but there's little value in transport all classes
|
||||
probability. In addition to keeping only most the probable class (or classes), we
|
||||
often want the probability to be above a certain threshold, otherwise we're
|
||||
not interested in the result. Because a significant portion of analytics results
|
||||
from the inference process don't have much value, we want to filter them out
|
||||
as early as possible. Since analytics results are only available after tensor
|
||||
decoding, the tensor decoder is tasked with this type filtering (NMS). The same
|
||||
concept exists for object detection, where NMS generally involves calculating
|
||||
the intersection-of-union (IoU) in combination with location and class probability.
|
||||
Because ML-based analytics are probabilistic by nature, they generally need a form of
|
||||
NMS post-processing.
|
||||
|
||||
#### Handling Multiple Tensors Simultaneously In A Tensor Decoder
|
||||
Sometimes, it is needed or more efficient to have a tensor decoder handle
|
||||
multiple tensors simultaneously. In some cases, the tensors are complementary and a
|
||||
tensor decoder needs to have both tensors to decode analytics result. In other
|
||||
cases, it's just more efficient to do it simultaneously because of the
|
||||
tensor-decoder's second job doing NMS. Let's consider YOLOv3, where 3 output tensors are
|
||||
produced for each input. One tensor represents detection of small objects, a second
|
||||
tensor medium size objects and a third tensor large size objects. In this context,
|
||||
it's beneficial to have the tensor decoder decode the 3 tensors simultaneously to
|
||||
perform the NMS on all the results, otherwise analytics results with low value
|
||||
would remain in the system for longer. This has implications for the negotiation
|
||||
of tensor decoders, that will be expanded on in the section dedicated to tensor decoder
|
||||
negotiation.
|
||||
|
||||
### Why Interpreting (decoding) Tensors
|
||||
As we described above, tensors contain information and are used to store analytics
|
||||
results. The analytics results are encoded in a model specific way into the
|
||||
tensor and unless their consumers, processes making use of analytics-results, are
|
||||
also model specific, they need to be decoded. Deciding if the analytics pipeline
|
||||
will have elements producing and consuming tensor directly into their encoded
|
||||
form, or if a tensor-decoding process will done between tensor production and
|
||||
consumption, is a design decision that involve compromise between re-usability
|
||||
and performance. As an example, an object detection overlay element would need to
|
||||
be model specific to directly consume tensor. Therefore, it would need to be
|
||||
re-written for any object-detection model using a different encoding scheme, but
|
||||
if the only goal of the analytics pipeline is to do this overlay, it would
|
||||
probably be the most efficient implementation. Another aspect in favour of
|
||||
interpreting tensor is that we can have multiple consumers of the analytics
|
||||
results, and if the tensor decoding is left to the consumers themselves, it implies
|
||||
decoding the same tensor multiple times. However, we can think of two models
|
||||
specifically designed to work together where the output of one model becomes the
|
||||
input of the downstream model. In this context the downstream model is not
|
||||
re-usable without the upstream model but they bypass the need for
|
||||
tensor-decoding and are very efficient. Another variation is that multiple
|
||||
models are merged into one model removing the need the multi-level inference,
|
||||
but again, this is a design decision involving compromise on re-usability,
|
||||
performance and effort. We aim to provide support for all these use cases,
|
||||
and to allow the analytics pipeline designer to make the best design decisions based
|
||||
on his specific context.
|
||||
|
||||
#### Analytics Meta
|
||||
The Analytics Meta (GstAnalyticsRelationMeta) is the foundation of re-usability of
|
||||
analytics results and its goal is to store analytics results (GstAnalyticsMtd)
|
||||
in an efficient way, and to allow to define relations between them. GstAnalyticsMtd
|
||||
is very primitive and is meant to be expanded. GstAnalyticsMtdClassification (storage
|
||||
for classification result), GstAnalyticsMtdObjectDetection (storage for
|
||||
object detection result), GstAnalyticsMtdTracking (storage for
|
||||
object tracking) are specialization and can used as reference to create other
|
||||
storage, based on GstAnalyticsMtd, for other types of analytics result.
|
||||
|
||||
There are two major use case for the ability to define relation between
|
||||
analytics results. The first one is define a relation between analytics results
|
||||
that were generated at different stages. A good example of this could be a first
|
||||
analysis detected cars from an image and a second level analysis where only
|
||||
section of image presenting a car is pushed to a second analysis to extract
|
||||
brand/model of the car in a section of the image. This analytics result is then
|
||||
appended to the original image with a relation defined with the object-detection
|
||||
result that have localized this car in the image.
|
||||
|
||||
The other use case for relations is to create composition by re-using existing
|
||||
GstAnalyticsMtd specialization. The relation between different analytics result is
|
||||
completely decoupled from the analytics result themselves.
|
||||
|
||||
All relation definitions are stored in
|
||||
GstAnaltyicsRelationMeta, which is a container of GstAnaltyicsMtd and also contains
|
||||
an adjacency-matrix storing relations. One of the benefits is the ability of a
|
||||
consumer of analytics meta to explore the graph and follow relations between
|
||||
analytics results without having to understand every type of result in the
|
||||
relation path. Another important aspect is that analytics meta are not
|
||||
specific to machine learning techniques and can also be used to store analysis
|
||||
results from computer vision, heuristics or other techniques. It can be used as
|
||||
a bridge between different techniques.
|
||||
|
||||
### Tensor Transport Mode
|
||||
Two transport mode are envisioned as Meta or as Media. Both mode have pros and
|
||||
cons which justify supporting both mode. Currently tensor are only transported
|
||||
as meta.
|
||||
|
||||
#### Tensor Transport As Meta
|
||||
In this mode tensor is attached to the buffer (the media) on which the analysis
|
||||
was performed. The advantage of this mode if the original media is kept in a
|
||||
direct association with analytics results. Further refinement analysis or
|
||||
consumption (like overlay) of the analytics result are easier when the media on
|
||||
which the analysis was performed is available and easily identifiable. Another
|
||||
advantage is the ability to keep a relation description between tensors in a
|
||||
refinement context On the other hand this mode of transporting analytics result
|
||||
make negotiation of tensor-decoder in particular difficult.
|
||||
|
||||
### Inference Sinkpad(s) Capabilities
|
||||
Sinkpad capability, before been constrained based on model, can be any
|
||||
media type.
|
||||
|
||||
### Inference Srcpad(s) Capabilities
|
||||
|
||||
Srcpads capabilities, will be identical to sinkpads capabilities.
|
||||
|
||||
# Reference
|
||||
- [Onnx-Refactor-MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4916)
|
||||
- [Analytics-Meta MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4962)
|
||||
|
||||
|
Loading…
Reference in a new issue