Scientific results - R&D Center

Human-computer interaction technology in virtual Reality - Gesture Interaction technology based on vision - Dr. Fulai Peng

2021-08-03

1. The introduction

Interaction is one of the three characteristics of virtual reality. Human-computer interaction in virtual reality refers to the interaction between users and virtual world objects generated by computers through interactive devices in a portable and natural way. A more natural and harmonious human-computer environment can be established through two-way perception between users and virtual environment. Virtual reality is the core link to provide users with experience, towards the application. This paper introduces a new type of virtual reality interaction technology - visual gesture interaction technology.

Figure 1. Gesture interaction technology based on vision

2. Gesture interaction technology based on vision

Gesture is the most important way of non-verbal communication between people and one of the important ways of interaction between people and VR virtual environment. The accuracy and rapidity of gesture recognition directly affect the accuracy, fluency and naturalness of human-computer interaction. Visually based gesture interaction, users do not need to wear devices, has the advantages of convenient interaction, natural and rich expression, in line with the general trend of human-computer natural interaction, a wide range of application. As an important part of human-computer interaction, gesture interaction based on vision is of great significance to realize the natural interaction between human and VR virtual environment and has a wide application prospect.

Gesture interaction based on vision uses gesture recognition method to realize human-computer interaction. The interaction process mainly includes four steps, as shown in Figure 2:1) Data collection: collect human hand images through a camera; 2) Hand detection and segmentation: detect whether there is a hand in the input image; if there is a hand, detect the specific position of the hand and segment the hand; 3) Gesture recognition: extract the features of the hand region and identify the types according to a certain method; 4) Use identification results to control people or objects in the virtual environment: send identification results to the virtual environment control system, so as to control virtual people/objects to achieve specific movements. Among them, gesture recognition is the core of the whole gesture interaction process, and hand detection and segmentation is the basis of gesture recognition.

Figure 2. The process of visual human-computer interaction

Gesture recognition is the key technology of gesture interaction, which directly affects the effect of gesture interaction and plays an important role in the whole interaction process. The following section introduces gesture recognition technology.

2.1 Hand detection and segmentation

Hand detection and segmentation is the basis of gesture recognition. Hand detection is to detect whether there is a hand in the image data and find the specific position of the hand in the image. Hand segmentation is the segmentation of the hand region from the image, easy to follow up operations, is conducive to reduce the amount of calculation. Hand detection and segmentation is the first step of gesture recognition and the basis of gesture recognition. Generally, objects are characterized by three characteristics: edge shape, texture, and color. At a certain distance, the texture performance of the hand is relatively smooth and the contrast is poor, so the advantage of using texture features to detect the hand is not obvious. For hand detection, shape and color characteristics are used to detect the hand. Therefore, common hand detection methods can be roughly divided into the following categories: methods based on shape information features, methods based on skin color information and methods based on motion information.

2.1.1 Method based on shape information feature

Shape is an important characteristic of description of image content, the shape of the hand and the shape of other objects exist a certain differences, so can take advantage of the shape differences will hand extracted from the image, can also be based on the shape information using image training set training classifier to detect the hand, this method is based on the classification of object detection method, it usually assume that the appearance of different gestures, And the difference is far greater than the difference between different people using the same gesture. This kind of method often uses the direction gradient histogram (HOG), Haar wavelet and scale invariant feature transform (SIFT) and other features.

2.1.2 Methods based on skin color information

Because human skin color and background there is a certain difference, and skin color has a natural translation invariance and rotation invariance, not influenced by perspective, the body posture, and so on, therefore, the method based on skin color information less amount of calculation and faster calculation speed, it is the commonly used method of hand detection, but easily affected by race, illumination, color etc. Color space (RGB, HSV, YCbCr, YUV, etc.) should be selected for hand detection with skin color information. In order to enhance the robustness of skin color detection under different light conditions, color Spaces (such as HSV, YCbCr, etc.) that separate brightness and chromosity components are selected first.

2.1.3 Method based on motion information

Motion information can be used as a method for hand detection, but some assumptions are often made about the gesturing person or background when using motion information for hand detection, such as the gesturing person can't move too fast, the gesturing person is still relative to the background or the amount of motion is very small, and the lighting condition of the scene changes little, etc. Assuming that the image acquisition device is fixed, the background is still or changes very little, which is called static background detection. In this case, there are mainly three detection methods: optical flow method, inter-frame difference method and background difference method. Optical flow method can obtain comprehensive scene information, not only can obtain gesture information, but also can obtain other information besides gesture, such as scene information. In the absence of any relevant information in the image, optical flow method can also detect the moving target independently, with good independence and a wide range of applications. However, optical flow method is complex, and it is difficult to meet the real-time requirements without the use of acceleration technology. The difference between frames method is simple and fast, which can eliminate the influence of external factors to a certain extent, and has good stability. However, the detection accuracy of moving targets is low, and the boundary of extracted target object is incomplete, which has high requirements on the interval between adjacent frames. The background difference method is simple, fast, and can detect the moving target completely. However, this algorithm can only be applied to the static background with fixed camera, and the error detection rate is high. The detected moving area usually includes the area outside the hand (such as the arm). Motor information can be used not only to detect the hand alone, but also in combination with other visual information to detect the hand region.

2.2 Gesture Recognition

Gesture recognition is a key technology of gesture interaction. It is a process of feature extraction and gesture classification of the segmtioned hand region. It can also be understood as a process of classifying points (or tracks) in the model parameter space into a subset of the space. Among them, static gestures (image-based gestures) correspond to a point in the model parameter space, while dynamic gestures (video-based gestures) correspond to a track in the model parameter space. Gesture recognition methods can be divided into the following methods: template matching method, machine learning method and hidden Markov model method.

2.2.1 Template-based matching

Template matching is one of the earliest and simplest pattern recognition methods, which is mostly used in static gesture recognition. This method is to match the input image with the template (point, curve or shape), and classify the image according to the matching similarity. The matching degree calculation methods include: Euclidean distance, Hausdorff distance, included Angle cosine, etc. Contour edge matching and elastic graph matching belong to template matching methods. The advantage of template matching method is that it is simple and fast, and is not affected by light, background, posture, etc., so it has a wide range of applications. However, the classification accuracy is not high, and the types of gestures that can be recognized are limited.

2.2.2 Machine learning-based methods

Machine learning uses statistical methods to solve uncertainty problems. Machine learning focuses on the algorithms that computers use to generate models from data, known as "learning algorithms." With a learning algorithm, a model can be generated based on the data, which can be used to make judgments when faced with new situations. Machine learning develops rapidly and is a hot topic in the field of computer application. Many static gestures based on appearance use machine learning methods. At present, the commonly used machine learning algorithms include support vector machine, artificial neural network, AdaBoost method and so on.

Support vector machine (SVM) is a kind of binary classification model. Its basic model is a linear classifier with the maximum interval defined in the feature space. Support vector machine can also be extended to nonlinear classifier by kernel method. Its learning strategy is to maximize the interval, which can be formalized to solve the convex quadratic programming problem, such a convex quadratic programming problem has a global optimal solution. Artificial neural network was born in the early 1940 s, it is made up of simple adaptive unit widely parallel interconnection network, it can simulate biological neural system to the real world by the interactive response, strong fault tolerance and robustness, high parallelism, adaptability, anti-jamming and mobile learning ability, etc. With the coming of deep learning craze, neural network is paid attention again, and is widely used in speech recognition and image classification. There are many kinds of neural networks, and gesture recognition rate is generally limited by the quality of hand detection model and the number of training samples. Boosting algorithm is a statistical learning method that improves weak learning algorithm to strong learning algorithm. By modifying the weight distribution of training data repeatedly, it constructs a series of basic classifiers (weak classifiers), and linearly combines these basic classifiers to form a strong classification. Boosting algorithm requires predicting the error upper limit of weak classifiers in advance, which is difficult to be applied in practice. AdaBoost is obtained by combining weighted voting with online allocation and promoting it under Boosting framework. AdaBoost is a famous representative of Boosting family and has a wide range of applications in human body detection and recognition. AdaBoost has the following advantages: AdaBoost provides a framework, in which various methods can be used to build sub-classifiers, simple weak classifiers can be used, without filtering features, and the phenomenon of fitting rarely occurs. AdaBoost does not need the prior knowledge of weak classifiers, nor does it need to know the upper limit of weak classifiers in advance. The final accuracy of strong classifiers depends on the classification accuracy of all weak classifiers, and it can dig deep the ability of weak classifiers. AdaBoost can adjust the assumed error rate adaptively according to the feedback of weak classifiers, with high execution efficiency and significantly improving the learning accuracy. However, in the training process, AdaBoost causes the weight of difficult to classify samples to increase exponentially, and the training will be too biased towards such difficult samples, and then the calculation of left and right errors and the selection of classifiers will reduce the accuracy of classifiers. In addition, AdaBoost is easily disturbed by noise, and the performance of AdaBoost depends on the selection of weak classifiers, and the training time of weak classifiers is too long.

2.2.3 Hidden Markov model method

Hidden Markov model (HMM) is a probabilistic model about time sequence, which describes the process of randomly generating unobobservable random sequence of states by a hidden Markov chain, and then generating an observation from each state to generate random sequence of observations. Hidden Markov models are very suitable for describing sequence models, especially for context-related situations. Hidden Markov model is an extension of Markov chain, a dynamic Bayesian network with simple structure, and a famous directed graph model. As a typical method based on probability statistics, hidden Markov model is widely used in speech recognition, gesture recognition and other fields. For gesture recognition, hidden Markov model is suitable for continuous gesture recognition, especially for complex gestures involving context. The training and recognition of hidden Markov model requires a large amount of calculation, especially in the analysis of continuous signals. The state transformation leads to the need to calculate a large number of probability density, and the parameter becomes more, which makes the speed of sample training and target recognition slow down. In order to solve this problem, the discrete hidden Markov model is used in gesture recognition system.

3. The conclusion

Based on visual gesture interaction is an important way people interact with virtual environment, has the advantages of natural and convenient interactive immersive experience of virtual reality is of great significance, although it has made some phased research results, but there are still many problems to be solved, such as hand detection under complicated background, and other modes of interactive integration, function integration, etc. Visual gesture interaction has important scientific value and broad application prospects. With the increasing demand for immersive experience in virtual reality, visual gesture interaction will certainly play an important role in virtual reality.