doc: Add analytics support design

Part-of: <https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/6139>
2025-04-16 21:14:44 +00:00 · 2024-02-19 00:01:31 -05:00 · 2024-02-19 00:01:31 -05:00 · a04ed4b91e
commit a04ed4b91e
parent fd971ed9d8
1 changed files with 285 additions and 0 deletions
--- a/subprojects/gst-docs/markdown/additional/design/machine-learning-analytics.md
+++ b/subprojects/gst-docs/markdown/additional/design/machine-learning-analytics.md
@ -0,0 +1,285 @@
+# Machine Learning Based Analytics
+
+Analytics refer to the process of extracting information from the content of the
+media (or medias). The analysis can be spatial only, for example, image analysis, or
+temporal only, like sound detection, or even spatio-temporal tracking or action recognition,
+multi-modal image+sound to detect a environment or behaviour. There's also
+scenarios where the results of the analysis is used as the input, with or without an
+additional media. This design aim is to support ML-based analytics and CV
+analytics and offer a way to bridge both techniques.
+
+## Vision
+
+With this design we aim at allowing GStreamer application developers to develop
+analytics pipeline easily while taking full advantage of the acceleration
+available on the platform where they deploy. The effort of moving the analytic
+pipeline to a different platform should be minimal.
+
+## Refinement Using Analytics Pipeline
+
+Similarly to content agnostic media processing (ex. Scaling, color-space change,
+serialization, ...), this design promote re-usability and simplicity by allowing
+the composition of complex analytics pipelines from simple dedicated analytics
+elements that complement each other.
+
+### Example
+Simple hypothetical example of an analytic pipeline.
+
+```
+---------+    +----------+    +---------------+    +----------------+
+| v4l2src |    | video    |    | onnxinference |    | tensor-decoder |
+|         |    |  convert |    |               |    |                |
+|        src-sink  scale src-sink1           src1-sink              src---
+|         |    |(pre-proc)|    | (analysis)    |    | (post-proc)    |   /
+---------+    +----------+    +---------------+    +----------------+  /
+                                                                       /
+----------------------------------------------------------------------
+|  +-------------+    +------+
+|  | Analytic-   |    | sink |
+|  |  overlay    |    |      |
+-sink           src-sink     |
+   | (analysis   |    |      |
+   |  -results   |    +------+
+   |  -consumer) |
+   +-------------+
+
+```
+
+## Supporting Neural Network Inference
+
+There are multiple frameworks supporting neural network inference. Those can be
+described more generally as computing graphs, as they are generally not limited
+to NN inference applications. Existing NN inference or computing graph frameworks,
+like ONNX-Runtime, are encapsulated into a GstElement/Filter. The inference element loads
+a model, describing the computing graph, specified by a property. The model
+expects inputs in a specific format and produce outputs in specific
+format. Depending on the model format, input/output formats can be extracted
+from the model, like with ONNX, but it is not always the case.
+
+### Inference Element
+Inference elements are an encapsulation of an NN Inference framework. Therefore
+they are specific to a framework, like ONNX-Runtime or TensorFlow-Lite.
+Other inference elements can be added.
+
+### Inference Input(s)
+The input format is defined by the model. Using the model input format the
+inference element can constrain its sinkpad(s) capabilities. Note, because tensors
+are very generic, the term also encapsulates images/frames, and the term input tensor is
+also used to describe inference input.
+
+### Inference Output(s)
+Output(s) of the inference are tensors and their format are also dictated by the
+model. Analysis results are generally encoded in the output tensor in a way that
+is specific to the model. Even models that target the same type of analysis
+encode results in different ways.
+
+### Models Format Not Describing Inputs/Outputs Tensor Format
+With some models, the input/output tensor format are not described. In
+this context, it's the responsibility of the analytics pipeline to push input
+tensors with the correct format into the inference process. In this context,
+the inference element designer is left with two choices: supporting a model manifest
+where inputs/outputs are described or leaving the constraining/fixing the
+inputs/outputs to analytics pipeline designer who can use caps filters to
+constrain inputs/outputs of the model.
+
+### Tensor Decoders
+In order to preserve the generality of the inference element, tensor decoding is
+omitted from the inference element and left to specialized elements that have a
+specific task of decoding tensor from a specific model. Additionally
+tensor decoding does not depend on a specific NN framework or inference element,
+this allow reusing the tensor decoders with a same model used with a
+different inference element. For example, a YOLOv3 tensor decoder can used to
+decode tensor from inference using YOLOv3 model with an element encapsulating
+ONNX or TFLite. Note that a tensor decoder can handle multiple tensors that have
+similar encoding.
+
+### Tensor
+N-dimensional vector.
+
+#### Tensor Type Identifier
+This is an identifier, string or quark, that uniquely identifies a tensor type. The
+tensor type describes the specific format used to encode analysis result in
+memory. This identifier is used by tensor-decoders to know if they can handle
+the decoding of a tensor. For this reason, from an implementation perspective,
+the tensor decoder is the ideal location to store the tensor type identifier as the code
+is already model specific. Since the tensor decoder is by design specific to a
+model, no generality is lost by storing it the tensor type identifier.
+
+#### Tensor Datatype
+This is the primitive type used to store tensor-data. Like `int8`,
+`uint8`, `float16`, `float32`, ...
+
+#### Tensor Dimension Cardinality
+
+Number of dimensions in the tensor.
+
+#### Tensor Dimension
+
+Tensor shape.
+
+- [a], 1-dimensional vector
+- [a x b], 2-dimensional vector
+- [a x b x c], 3-dimensional vector
+- [a x b x ... x n], N-dimensional vector
+
+### Tensor Decoders Need to Recognize Tensor(s) They Can Handle
+
+As mention before, tensor decoders need to be able to recognize tensor(s) they can
+handle. It's important to keep in mind that multiple tensors can be attached to
+a buffer, when tensors are transported as a meta. It could be easy to
+believe that tensor's (cardinality + dimension + data type) is sufficient to
+recognize a specific tensor format but we need to remember that analysis results
+are encoded into the tensor and retrieve analysis results require a decoding
+process specific to the model. In other words a tensor A:{cardinality:3,
+dimension: 100 x 5, datatype:int8) and a tensor B:{cardinality:3, 100 x 5,
+datatype:int8) can have completely different meaning.
+
+A could be: (Object-detection where each candidate is encoded with (top-left)
+coordinates, width, height and object location confidence level)
+
+```
+0 : [ x1, y1, w, h, location confidence]
+1 : [ x1, y1, w, h, location confidence]
+...
+99: [ x1, y1, w, h, location confidence]
+```
+
+B could be: (Object-detection where each candidate is encoded with (top-left)
+coordinates, (bottom-right) coordinate and object class confidence level)
+```
+0 : [ x1, y1, x2, y2, class confidence]
+1 : [ x1, y1, x2, y2, class confidence]
+...
+99: [ x1, y1, x2, y2, class confidence]
+```
+We can see that even if A and B have same (cardinality, dimension, data type) a
+tensor-decoder expecting A and decoding B would wrong.
+
+In general, for high cardinality tensors, the risk of having two tensors with same
+(cardinality + dimension + data type) is low, but if we think of low cardinality
+tensors typical of classification (1 x C), we can see that the risk is much
+higher. For this reason, we believe it's not sufficient for tensor-decoder to
+only rely on (cardinality + dimension + data type) to identify tensor it can
+handle.
+
+#### A Tensor Decoder's Second Job: Non-Maximum Suppression (NMS)
+
+The main functionality of Tensor-Decoders is to extract analytics-results from tensors,
+but in addition to decoding tensors, in general a second phase of post-processing
+is handled by tensor-decoder. This post-processing phase is called non-maximum
+suppression (NMS). A simplest example of NMS, is with classification. For every
+input, the classification model will produce a probability for potential class.
+In general, we're mostly interested in the most probable class or few most
+probable class, but there's little value in transport all classes
+probability. In addition to keeping only most the probable class (or classes), we
+often want the probability to be above a certain threshold, otherwise we're
+not interested in the result. Because a significant portion of analytics results
+from the inference process don't have much value, we want to filter them out
+as early as possible. Since analytics results are only available after tensor
+decoding, the tensor decoder is tasked with this type filtering (NMS). The same
+concept exists for object detection, where NMS generally involves calculating
+the intersection-of-union (IoU) in combination with location and class probability.
+Because ML-based analytics are probabilistic by nature, they generally need a form of
+NMS post-processing.
+
+#### Handling Multiple Tensors Simultaneously In A Tensor Decoder
+Sometimes, it is needed or more efficient to have a tensor decoder handle
+multiple tensors simultaneously. In some cases, the tensors are complementary and a
+tensor decoder needs to have both tensors to decode analytics result. In other
+cases, it's just more efficient to do it simultaneously because of the
+tensor-decoder's second job doing NMS. Let's consider YOLOv3, where 3 output tensors are
+produced for each input. One tensor represents detection of small objects, a second
+tensor medium size objects and a third tensor large size objects. In this context,
+it's beneficial to have the tensor decoder decode the 3 tensors simultaneously to
+perform the NMS on all the results, otherwise analytics results with low value
+would remain in the system for longer. This has implications for the negotiation
+of tensor decoders, that will be expanded on in  the section dedicated to tensor decoder
+negotiation.
+
+### Why Interpreting (decoding) Tensors
+As we described above, tensors contain information and are used to store analytics
+results. The analytics results are encoded in a model specific way into the
+tensor and unless their consumers, processes making use of analytics-results, are
+also model specific, they need to be decoded. Deciding if the analytics pipeline
+will have elements producing and consuming tensor directly into their encoded
+form, or if a tensor-decoding process will done between tensor production and
+consumption, is a design decision that involve compromise between re-usability
+and performance. As an example, an object detection overlay element would need to
+be model specific to directly consume tensor. Therefore, it would need to be
+re-written for any object-detection model using a different encoding scheme, but
+if the only goal of the analytics pipeline is to do this overlay, it would
+probably be the most efficient implementation. Another aspect in favour of
+interpreting tensor is that we can have multiple consumers of the analytics
+results, and if the tensor decoding is left to the consumers themselves, it implies
+decoding the same tensor multiple times. However, we can think of two models
+specifically designed to work together where the output of one model becomes the
+input of the downstream model. In this context the downstream model is not
+re-usable without the upstream model but they bypass the need for
+tensor-decoding and are very efficient. Another variation is that multiple
+models are merged into one model removing the need the multi-level inference,
+but again, this is a design decision involving compromise on re-usability,
+performance and effort. We aim to provide support for all these use cases,
+and to allow the analytics pipeline designer to make the best design decisions based
+on his specific context.
+
+#### Analytics Meta
+The Analytics Meta (GstAnalyticsRelationMeta) is the foundation of re-usability of
+analytics results and its goal is to store analytics results (GstAnalyticsMtd)
+in an efficient way, and to allow to define relations between them. GstAnalyticsMtd
+is very primitive and is meant to be expanded. GstAnalyticsMtdClassification (storage
+for classification result), GstAnalyticsMtdObjectDetection (storage for
+object detection result), GstAnalyticsMtdTracking (storage for
+object tracking) are specialization and can used as reference to create other
+storage, based on GstAnalyticsMtd, for other types of analytics result.
+
+There are two major use case for the ability to define relation between
+analytics results. The first one is define a relation between analytics results
+that were generated at different stages. A good example of this could be a first
+analysis detected cars from an image and a second level analysis where only
+section of image presenting a car is pushed to a second analysis to extract
+brand/model of the car in a section of the image. This analytics result is then
+appended to the original image with a relation defined with the object-detection
+result that have localized this car in the image.
+
+The other use case for relations is to create composition by re-using existing
+GstAnalyticsMtd specialization. The relation between different analytics result is
+completely decoupled from the analytics result themselves.
+
+All relation definitions are stored in
+GstAnaltyicsRelationMeta, which is a container of GstAnaltyicsMtd and also contains
+an adjacency-matrix storing relations. One of the benefits is the ability of a
+consumer of analytics meta to explore the graph and follow relations between
+analytics results without having to understand every type of result in the
+relation path. Another important aspect is that analytics meta are not
+specific to machine learning techniques and can also be used to store analysis
+results from computer vision, heuristics or other techniques. It can be used as
+a bridge between different techniques.
+
+### Tensor Transport Mode
+Two transport mode are envisioned as Meta or as Media. Both mode have pros and
+cons which justify supporting both mode. Currently tensor are only transported
+as meta.
+
+#### Tensor Transport As Meta
+In this mode tensor is attached to the buffer (the media) on which the analysis
+was performed. The advantage of this mode if the original media is kept in a
+direct association with analytics results. Further refinement analysis or
+consumption (like overlay) of the analytics result are easier when the media on
+which the analysis was performed is available and easily identifiable. Another
+advantage is the ability to keep a relation description between tensors in a
+refinement context On the other hand this mode of transporting analytics result
+make negotiation of tensor-decoder in particular difficult.
+
+### Inference Sinkpad(s) Capabilities
+Sinkpad capability, before been constrained based on model, can be any
+media type.
+
+### Inference Srcpad(s) Capabilities
+
+Srcpads capabilities, will be identical to sinkpads capabilities.
+
+# Reference
+- [Onnx-Refactor-MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4916)
+- [Analytics-Meta MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4962)
+
+