diff --git a/subprojects/gst-docs/markdown/additional/design/machine-learning-analytics.md b/subprojects/gst-docs/markdown/additional/design/machine-learning-analytics.md new file mode 100644 index 0000000000..950c4dc910 --- /dev/null +++ b/subprojects/gst-docs/markdown/additional/design/machine-learning-analytics.md @@ -0,0 +1,285 @@ +# Machine Learning Based Analytics + +Analytics refer to the process of extracting information from the content of the +media (or medias). The analysis can be spatial only, for example, image analysis, or +temporal only, like sound detection, or even spatio-temporal tracking or action recognition, +multi-modal image+sound to detect a environment or behaviour. There's also +scenarios where the results of the analysis is used as the input, with or without an +additional media. This design aim is to support ML-based analytics and CV +analytics and offer a way to bridge both techniques. + +## Vision + +With this design we aim at allowing GStreamer application developers to develop +analytics pipeline easily while taking full advantage of the acceleration +available on the platform where they deploy. The effort of moving the analytic +pipeline to a different platform should be minimal. + +## Refinement Using Analytics Pipeline + +Similarly to content agnostic media processing (ex. Scaling, color-space change, +serialization, ...), this design promote re-usability and simplicity by allowing +the composition of complex analytics pipelines from simple dedicated analytics +elements that complement each other. + +### Example +Simple hypothetical example of an analytic pipeline. + +``` ++---------+ +----------+ +---------------+ +----------------+ +| v4l2src | | video | | onnxinference | | tensor-decoder | +| | | convert | | | | | +| src-sink scale src-sink1 src1-sink src--- +| | |(pre-proc)| | (analysis) | | (post-proc) | / ++---------+ +----------+ +---------------+ +----------------+ / + / +---------------------------------------------------------------------- +| +-------------+ +------+ +| | Analytic- | | sink | +| | overlay | | | +-sink src-sink | + | (analysis | | | + | -results | +------+ + | -consumer) | + +-------------+ + +``` + +## Supporting Neural Network Inference + +There are multiple frameworks supporting neural network inference. Those can be +described more generally as computing graphs, as they are generally not limited +to NN inference applications. Existing NN inference or computing graph frameworks, +like ONNX-Runtime, are encapsulated into a GstElement/Filter. The inference element loads +a model, describing the computing graph, specified by a property. The model +expects inputs in a specific format and produce outputs in specific +format. Depending on the model format, input/output formats can be extracted +from the model, like with ONNX, but it is not always the case. + +### Inference Element +Inference elements are an encapsulation of an NN Inference framework. Therefore +they are specific to a framework, like ONNX-Runtime or TensorFlow-Lite. +Other inference elements can be added. + +### Inference Input(s) +The input format is defined by the model. Using the model input format the +inference element can constrain its sinkpad(s) capabilities. Note, because tensors +are very generic, the term also encapsulates images/frames, and the term input tensor is +also used to describe inference input. + +### Inference Output(s) +Output(s) of the inference are tensors and their format are also dictated by the +model. Analysis results are generally encoded in the output tensor in a way that +is specific to the model. Even models that target the same type of analysis +encode results in different ways. + +### Models Format Not Describing Inputs/Outputs Tensor Format +With some models, the input/output tensor format are not described. In +this context, it's the responsibility of the analytics pipeline to push input +tensors with the correct format into the inference process. In this context, +the inference element designer is left with two choices: supporting a model manifest +where inputs/outputs are described or leaving the constraining/fixing the +inputs/outputs to analytics pipeline designer who can use caps filters to +constrain inputs/outputs of the model. + +### Tensor Decoders +In order to preserve the generality of the inference element, tensor decoding is +omitted from the inference element and left to specialized elements that have a +specific task of decoding tensor from a specific model. Additionally +tensor decoding does not depend on a specific NN framework or inference element, +this allow reusing the tensor decoders with a same model used with a +different inference element. For example, a YOLOv3 tensor decoder can used to +decode tensor from inference using YOLOv3 model with an element encapsulating +ONNX or TFLite. Note that a tensor decoder can handle multiple tensors that have +similar encoding. + +### Tensor +N-dimensional vector. + +#### Tensor Type Identifier +This is an identifier, string or quark, that uniquely identifies a tensor type. The +tensor type describes the specific format used to encode analysis result in +memory. This identifier is used by tensor-decoders to know if they can handle +the decoding of a tensor. For this reason, from an implementation perspective, +the tensor decoder is the ideal location to store the tensor type identifier as the code +is already model specific. Since the tensor decoder is by design specific to a +model, no generality is lost by storing it the tensor type identifier. + +#### Tensor Datatype +This is the primitive type used to store tensor-data. Like `int8`, +`uint8`, `float16`, `float32`, ... + +#### Tensor Dimension Cardinality + +Number of dimensions in the tensor. + +#### Tensor Dimension + +Tensor shape. + +- [a], 1-dimensional vector +- [a x b], 2-dimensional vector +- [a x b x c], 3-dimensional vector +- [a x b x ... x n], N-dimensional vector + +### Tensor Decoders Need to Recognize Tensor(s) They Can Handle + +As mention before, tensor decoders need to be able to recognize tensor(s) they can +handle. It's important to keep in mind that multiple tensors can be attached to +a buffer, when tensors are transported as a meta. It could be easy to +believe that tensor's (cardinality + dimension + data type) is sufficient to +recognize a specific tensor format but we need to remember that analysis results +are encoded into the tensor and retrieve analysis results require a decoding +process specific to the model. In other words a tensor A:{cardinality:3, +dimension: 100 x 5, datatype:int8) and a tensor B:{cardinality:3, 100 x 5, +datatype:int8) can have completely different meaning. + +A could be: (Object-detection where each candidate is encoded with (top-left) +coordinates, width, height and object location confidence level) + +``` +0 : [ x1, y1, w, h, location confidence] +1 : [ x1, y1, w, h, location confidence] +... +99: [ x1, y1, w, h, location confidence] +``` + +B could be: (Object-detection where each candidate is encoded with (top-left) +coordinates, (bottom-right) coordinate and object class confidence level) +``` +0 : [ x1, y1, x2, y2, class confidence] +1 : [ x1, y1, x2, y2, class confidence] +... +99: [ x1, y1, x2, y2, class confidence] +``` +We can see that even if A and B have same (cardinality, dimension, data type) a +tensor-decoder expecting A and decoding B would wrong. + +In general, for high cardinality tensors, the risk of having two tensors with same +(cardinality + dimension + data type) is low, but if we think of low cardinality +tensors typical of classification (1 x C), we can see that the risk is much +higher. For this reason, we believe it's not sufficient for tensor-decoder to +only rely on (cardinality + dimension + data type) to identify tensor it can +handle. + +#### A Tensor Decoder's Second Job: Non-Maximum Suppression (NMS) + +The main functionality of Tensor-Decoders is to extract analytics-results from tensors, +but in addition to decoding tensors, in general a second phase of post-processing +is handled by tensor-decoder. This post-processing phase is called non-maximum +suppression (NMS). A simplest example of NMS, is with classification. For every +input, the classification model will produce a probability for potential class. +In general, we're mostly interested in the most probable class or few most +probable class, but there's little value in transport all classes +probability. In addition to keeping only most the probable class (or classes), we +often want the probability to be above a certain threshold, otherwise we're +not interested in the result. Because a significant portion of analytics results +from the inference process don't have much value, we want to filter them out +as early as possible. Since analytics results are only available after tensor +decoding, the tensor decoder is tasked with this type filtering (NMS). The same +concept exists for object detection, where NMS generally involves calculating +the intersection-of-union (IoU) in combination with location and class probability. +Because ML-based analytics are probabilistic by nature, they generally need a form of +NMS post-processing. + +#### Handling Multiple Tensors Simultaneously In A Tensor Decoder +Sometimes, it is needed or more efficient to have a tensor decoder handle +multiple tensors simultaneously. In some cases, the tensors are complementary and a +tensor decoder needs to have both tensors to decode analytics result. In other +cases, it's just more efficient to do it simultaneously because of the +tensor-decoder's second job doing NMS. Let's consider YOLOv3, where 3 output tensors are +produced for each input. One tensor represents detection of small objects, a second +tensor medium size objects and a third tensor large size objects. In this context, +it's beneficial to have the tensor decoder decode the 3 tensors simultaneously to +perform the NMS on all the results, otherwise analytics results with low value +would remain in the system for longer. This has implications for the negotiation +of tensor decoders, that will be expanded on in the section dedicated to tensor decoder +negotiation. + +### Why Interpreting (decoding) Tensors +As we described above, tensors contain information and are used to store analytics +results. The analytics results are encoded in a model specific way into the +tensor and unless their consumers, processes making use of analytics-results, are +also model specific, they need to be decoded. Deciding if the analytics pipeline +will have elements producing and consuming tensor directly into their encoded +form, or if a tensor-decoding process will done between tensor production and +consumption, is a design decision that involve compromise between re-usability +and performance. As an example, an object detection overlay element would need to +be model specific to directly consume tensor. Therefore, it would need to be +re-written for any object-detection model using a different encoding scheme, but +if the only goal of the analytics pipeline is to do this overlay, it would +probably be the most efficient implementation. Another aspect in favour of +interpreting tensor is that we can have multiple consumers of the analytics +results, and if the tensor decoding is left to the consumers themselves, it implies +decoding the same tensor multiple times. However, we can think of two models +specifically designed to work together where the output of one model becomes the +input of the downstream model. In this context the downstream model is not +re-usable without the upstream model but they bypass the need for +tensor-decoding and are very efficient. Another variation is that multiple +models are merged into one model removing the need the multi-level inference, +but again, this is a design decision involving compromise on re-usability, +performance and effort. We aim to provide support for all these use cases, +and to allow the analytics pipeline designer to make the best design decisions based +on his specific context. + +#### Analytics Meta +The Analytics Meta (GstAnalyticsRelationMeta) is the foundation of re-usability of +analytics results and its goal is to store analytics results (GstAnalyticsMtd) +in an efficient way, and to allow to define relations between them. GstAnalyticsMtd +is very primitive and is meant to be expanded. GstAnalyticsMtdClassification (storage +for classification result), GstAnalyticsMtdObjectDetection (storage for +object detection result), GstAnalyticsMtdTracking (storage for +object tracking) are specialization and can used as reference to create other +storage, based on GstAnalyticsMtd, for other types of analytics result. + +There are two major use case for the ability to define relation between +analytics results. The first one is define a relation between analytics results +that were generated at different stages. A good example of this could be a first +analysis detected cars from an image and a second level analysis where only +section of image presenting a car is pushed to a second analysis to extract +brand/model of the car in a section of the image. This analytics result is then +appended to the original image with a relation defined with the object-detection +result that have localized this car in the image. + +The other use case for relations is to create composition by re-using existing +GstAnalyticsMtd specialization. The relation between different analytics result is +completely decoupled from the analytics result themselves. + +All relation definitions are stored in +GstAnaltyicsRelationMeta, which is a container of GstAnaltyicsMtd and also contains +an adjacency-matrix storing relations. One of the benefits is the ability of a +consumer of analytics meta to explore the graph and follow relations between +analytics results without having to understand every type of result in the +relation path. Another important aspect is that analytics meta are not +specific to machine learning techniques and can also be used to store analysis +results from computer vision, heuristics or other techniques. It can be used as +a bridge between different techniques. + +### Tensor Transport Mode +Two transport mode are envisioned as Meta or as Media. Both mode have pros and +cons which justify supporting both mode. Currently tensor are only transported +as meta. + +#### Tensor Transport As Meta +In this mode tensor is attached to the buffer (the media) on which the analysis +was performed. The advantage of this mode if the original media is kept in a +direct association with analytics results. Further refinement analysis or +consumption (like overlay) of the analytics result are easier when the media on +which the analysis was performed is available and easily identifiable. Another +advantage is the ability to keep a relation description between tensors in a +refinement context On the other hand this mode of transporting analytics result +make negotiation of tensor-decoder in particular difficult. + +### Inference Sinkpad(s) Capabilities +Sinkpad capability, before been constrained based on model, can be any +media type. + +### Inference Srcpad(s) Capabilities + +Srcpads capabilities, will be identical to sinkpads capabilities. + +# Reference +- [Onnx-Refactor-MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4916) +- [Analytics-Meta MR](https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/4962) + +