gstreamer/subprojects/gst-docs/markdown/additional/design/machine-learning-analytics.md

286 lines
15 KiB
Markdown
Raw Normal View History

# Machine Learning Based Analytics
Analytics refer to the process of extracting information from the content of the
media (or medias). The analysis can be spatial only, for example, image analysis, or
temporal only, like sound detection, or even spatio-temporal tracking or action recognition,
multi-modal image+sound to detect a environment or behaviour. There's also
scenarios where the results of the analysis is used as the input, with or without an
additional media. This design aim is to support ML-based analytics and CV
analytics and offer a way to bridge both techniques.
## Vision
With this design we aim at allowing GStreamer application developers to develop
analytics pipeline easily while taking full advantage of the acceleration
available on the platform where they deploy. The effort of moving the analytic
pipeline to a different platform should be minimal.
## Refinement Using Analytics Pipeline
Similarly to content agnostic media processing (ex. Scaling, color-space change,
serialization, ...), this design promote re-usability and simplicity by allowing
the composition of complex analytics pipelines from simple dedicated analytics
elements that complement each other.
### Example
Simple hypothetical example of an analytic pipeline.
```
+---------+ +----------+ +---------------+ +----------------+
| v4l2src | | video | | onnxinference | | tensor-decoder |
| | | convert | | | | |
| src-sink scale src-sink1 src1-sink src---
| | |(pre-proc)| | (analysis) | | (post-proc) | /
+---------+ +----------+ +---------------+ +----------------+ /
/
----------------------------------------------------------------------
| +-------------+ +------+
| | Analytic- | | sink |
| | overlay | | |
-sink src-sink |
| (analysis | | |
| -results | +------+
| -consumer) |
+-------------+
```
## Supporting Neural Network Inference
There are multiple frameworks supporting neural network inference. Those can be
described more generally as computing graphs, as they are generally not limited
to NN inference applications. Existing NN inference or computing graph frameworks,
like ONNX-Runtime, are encapsulated into a GstElement/Filter. The inference element loads
a model, describing the computing graph, specified by a property. The model
expects inputs in a specific format and produce outputs in specific
format. Depending on the model format, input/output formats can be extracted
from the model, like with ONNX, but it is not always the case.
### Inference Element
Inference elements are an encapsulation of an NN Inference framework. Therefore
they are specific to a framework, like ONNX-Runtime or TensorFlow-Lite.
Other inference elements can be added.
### Inference Input(s)
The input format is defined by the model. Using the model input format the
inference element can constrain its sinkpad(s) capabilities. Note, because tensors
are very generic, the term also encapsulates images/frames, and the term input tensor is
also used to describe inference input.
### Inference Output(s)
Output(s) of the inference are tensors and their format are also dictated by the
model. Analysis results are generally encoded in the output tensor in a way that
is specific to the model. Even models that target the same type of analysis
encode results in different ways.
### Models Format Not Describing Inputs/Outputs Tensor Format
With some models, the input/output tensor format are not described. In
this context, it's the responsibility of the analytics pipeline to push input
tensors with the correct format into the inference process. In this context,
the inference element designer is left with two choices: supporting a model manifest
where inputs/outputs are described or leaving the constraining/fixing the
inputs/outputs to analytics pipeline designer who can use caps filters to
constrain inputs/outputs of the model.
### Tensor Decoders
In order to preserve the generality of the inference element, tensor decoding is
omitted from the inference element and left to specialized elements that have a
specific task of decoding tensor from a specific model. Additionally
tensor decoding does not depend on a specific NN framework or inference element,
this allow reusing the tensor decoders with a same model used with a
different inference element. For example, a YOLOv3 tensor decoder can used to
decode tensor from inference using YOLOv3 model with an element encapsulating
ONNX or TFLite. Note that a tensor decoder can handle multiple tensors that have
similar encoding.
### Tensor
N-dimensional vector.
#### Tensor Type Identifier
This is an identifier, string or quark, that uniquely identifies a tensor type. The
tensor type describes the specific format used to encode analysis result in
memory. This identifier is used by tensor-decoders to know if they can handle
the decoding of a tensor. For this reason, from an implementation perspective,
the tensor decoder is the ideal location to store the tensor type identifier as the code
is already model specific. Since the tensor decoder is by design specific to a
model, no generality is lost by storing it the tensor type identifier.
#### Tensor Datatype
This is the primitive type used to store tensor-data. Like `int8`,
`uint8`, `float16`, `float32`, ...
#### Tensor Dimension Cardinality
Number of dimensions in the tensor.
#### Tensor Dimension
Tensor shape.
- [a], 1-dimensional vector
- [a x b], 2-dimensional vector
- [a x b x c], 3-dimensional vector
- [a x b x ... x n], N-dimensional vector
### Tensor Decoders Need to Recognize Tensor(s) They Can Handle
As mention before, tensor decoders need to be able to recognize tensor(s) they can
handle. It's important to keep in mind that multiple tensors can be attached to
a buffer, when tensors are transported as a meta. It could be easy to
believe that tensor's (cardinality + dimension + data type) is sufficient to
recognize a specific tensor format but we need to remember that analysis results
are encoded into the tensor and retrieve analysis results require a decoding
process specific to the model. In other words a tensor A:{cardinality:3,
dimension: 100 x 5, datatype:int8) and a tensor B:{cardinality:3, 100 x 5,
datatype:int8) can have completely different meaning.
A could be: (Object-detection where each candidate is encoded with (top-left)
coordinates, width, height and object location confidence level)
```
0 : [ x1, y1, w, h, location confidence]
1 : [ x1, y1, w, h, location confidence]
...
99: [ x1, y1, w, h, location confidence]
```
B could be: (Object-detection where each candidate is encoded with (top-left)
coordinates, (bottom-right) coordinate and object class confidence level)
```
0 : [ x1, y1, x2, y2, class confidence]
1 : [ x1, y1, x2, y2, class confidence]
...
99: [ x1, y1, x2, y2, class confidence]
```
We can see that even if A and B have same (cardinality, dimension, data type) a
tensor-decoder expecting A and decoding B would wrong.
In general, for high cardinality tensors, the risk of having two tensors with same
(cardinality + dimension + data type) is low, but if we think of low cardinality
tensors typical of classification (1 x C), we can see that the risk is much
higher. For this reason, we believe it's not sufficient for tensor-decoder to
only rely on (cardinality + dimension + data type) to identify tensor it can
handle.
#### A Tensor Decoder's Second Job: Non-Maximum Suppression (NMS)
The main functionality of Tensor-Decoders is to extract analytics-results from tensors,
but in addition to decoding tensors, in general a second phase of post-processing
is handled by tensor-decoder. This post-processing phase is called non-maximum
suppression (NMS). A simplest example of NMS, is with classification. For every
input, the classification model will produce a probability for potential class.
In general, we're mostly interested in the most probable class or few most
probable class, but there's little value in transport all classes
probability. In addition to keeping only most the probable class (or classes), we
often want the probability to be above a certain threshold, otherwise we're
not interested in the result. Because a significant portion of analytics results
from the inference process don't have much value, we want to filter them out
as early as possible. Since analytics results are only available after tensor
decoding, the tensor decoder is tasked with this type filtering (NMS). The same
concept exists for object detection, where NMS generally involves calculating
the intersection-of-union (IoU) in combination with location and class probability.
Because ML-based analytics are probabilistic by nature, they generally need a form of
NMS post-processing.
#### Handling Multiple Tensors Simultaneously In A Tensor Decoder
Sometimes, it is needed or more efficient to have a tensor decoder handle
multiple tensors simultaneously. In some cases, the tensors are complementary and a
tensor decoder needs to have both tensors to decode analytics result. In other
cases, it's just more efficient to do it simultaneously because of the
tensor-decoder's second job doing NMS. Let's consider YOLOv3, where 3 output tensors are
produced for each input. One tensor represents detection of small objects, a second
tensor medium size objects and a third tensor large size objects. In this context,
it's beneficial to have the tensor decoder decode the 3 tensors simultaneously to
perform the NMS on all the results, otherwise analytics results with low value
would remain in the system for longer. This has implications for the negotiation
of tensor decoders, that will be expanded on in the section dedicated to tensor decoder
negotiation.
### Why Interpreting (decoding) Tensors
As we described above, tensors contain information and are used to store analytics
results. The analytics results are encoded in a model specific way into the
tensor and unless their consumers, processes making use of analytics-results, are
also model specific, they need to be decoded. Deciding if the analytics pipeline
will have elements producing and consuming tensor directly into their encoded
form, or if a tensor-decoding process will done between tensor production and
consumption, is a design decision that involve compromise between re-usability
and performance. As an example, an object detection overlay element would need to
be model specific to directly consume tensor. Therefore, it would need to be
re-written for any object-detection model using a different encoding scheme, but
if the only goal of the analytics pipeline is to do this overlay, it would
probably be the most efficient implementation. Another aspect in favour of
interpreting tensor is that we can have multiple consumers of the analytics
results, and if the tensor decoding is left to the consumers themselves, it implies
decoding the same tensor multiple times. However, we can think of two models
specifically designed to work together where the output of one model becomes the
input of the downstream model. In this context the downstream model is not
re-usable without the upstream model but they bypass the need for
tensor-decoding and are very efficient. Another variation is that multiple
models are merged into one model removing the need the multi-level inference,
but again, this is a design decision involving compromise on re-usability,
performance and effort. We aim to provide support for all these use cases,
and to allow the analytics pipeline designer to make the best design decisions based
on his specific context.
#### Analytics Meta
The Analytics Meta (GstAnalyticsRelationMeta) is the foundation of re-usability of
analytics results and its goal is to store analytics results (GstAnalyticsMtd)
in an efficient way, and to allow to define relations between them. GstAnalyticsMtd
is very primitive and is meant to be expanded. GstAnalyticsMtdClassification (storage
for classification result), GstAnalyticsMtdObjectDetection (storage for
object detection result), GstAnalyticsMtdTracking (storage for
object tracking) are specialization and can used as reference to create other
storage, based on GstAnalyticsMtd, for other types of analytics result.
There are two major use case for the ability to define relation between
analytics results. The first one is define a relation between analytics results
that were generated at different stages. A good example of this could be a first
analysis detected cars from an image and a second level analysis where only
section of image presenting a car is pushed to a second analysis to extract
brand/model of the car in a section of the image. This analytics result is then
appended to the original image with a relation defined with the object-detection
result that have localized this car in the image.
The other use case for relations is to create composition by re-using existing
GstAnalyticsMtd specialization. The relation between different analytics result is
completely decoupled from the analytics result themselves.
All relation definitions are stored in
GstAnaltyicsRelationMeta, which is a container of GstAnaltyicsMtd and also contains
an adjacency-matrix storing relations. One of the benefits is the ability of a
consumer of analytics meta to explore the graph and follow relations between
analytics results without having to understand every type of result in the
relation path. Another important aspect is that analytics meta are not
specific to machine learning techniques and can also be used to store analysis
results from computer vision, heuristics or other techniques. It can be used as
a bridge between different techniques.
### Tensor Transport Mode
Two transport mode are envisioned as Meta or as Media. Both mode have pros and
cons which justify supporting both mode. Currently tensor are only transported
as meta.
#### Tensor Transport As Meta
In this mode tensor is attached to the buffer (the media) on which the analysis
was performed. The advantage of this mode if the original media is kept in a
direct association with analytics results. Further refinement analysis or
consumption (like overlay) of the analytics result are easier when the media on
which the analysis was performed is available and easily identifiable. Another
advantage is the ability to keep a relation description between tensors in a
refinement context On the other hand this mode of transporting analytics result
make negotiation of tensor-decoder in particular difficult.
### Inference Sinkpad(s) Capabilities
Sinkpad capability, before been constrained based on model, can be any
media type.
### Inference Srcpad(s) Capabilities
Srcpads capabilities, will be identical to sinkpads capabilities.
# Reference
- [Onnx-Refactor-MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4916)
- [Analytics-Meta MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4962)