• Home
  • Scientific results
  • A Brief introduction to human-computer Interaction in virtual Reality - Dr. Zhu Zhang

A Brief introduction to human-computer Interaction in virtual Reality - Dr. Zhu Zhang


Abstract: Human-computer interaction is one of the core technologies of virtual reality. Thanks to the development of graphics technology and display technology, virtual reality technology has made great progress. This paper mainly reviews and analyzes the design goals of human-computer interaction in virtual reality, and the main research achievements of virtual reality technologies such as THREE-DIMENSIONAL interaction, gesture interaction, voice interaction, tactile interaction and multi-channel interaction. Finally, some problems to be solved are proposed.

Keywords: virtual reality human-computer interaction three-dimensional interaction

1. Design objectives of human-computer interaction in virtual reality

Virtual reality is a computer system that can create and experience virtual worlds. It is generated by computers and acts on users through sight, hearing, touch, smell and so on to produce the feeling of sound coming to the environment for users. [1] Among them, immersion, interaction and imagination are the three basic characteristics of virtual reality system.

The combination of virtual reality technology and human-computer interaction technology makes each other have the advantages of each other. [2] The goal of human-computer interaction design is to transform the user's behavior and state (input) into an expression that can be understood and operated by the computer through appropriate metaphors, and to transform the computer's behavior and state (output) into an expression that can be understood and operated by the human, and at the same time to give feedback to the human through the interface. On the one hand, virtual reality needs to perceive the user's muscle movement, posture, language, body tracking and other sensory channels input information; On the other hand, it can simulate the realistic feeling of the real world [3] through multiple sensory channels such as human vision, hearing, touch and smell, so as to establish a natural and harmonious man-machine environment.

Human-computer interaction technology in virtual reality

Human-computer interaction technology refers to the specific method of completing a given interaction task through the device and interface. The human-computer interaction technologies involved include not only three-dimensional interaction and postural interaction, but also voice interaction technology, force/touch interaction technology and multi-channel interaction technology.


Compared with 2d interaction, 3D interaction in VIRTUAL reality provides more freedom of operation, and interactive tasks are more complex. Therefore, interactive interface design has a larger design space, so new interactive metaphors and technologies are needed. Interactive metaphor is to compare or abstract some mechanisms existing in the real world and borrow them into the interaction process. In 3d interaction, interactive metaphors map the input device's spatial direction/position information or discrete button state into virtual space operations to complete specific interactive tasks. Different interactive metaphors provide different interaction methods. According to the different metaphors of the interaction system for input, the existing three dimensional interaction can be divided into two categories: direct mapping and indirect mapping.

Direct mapping input refers to the direct mapping of the location/space information entered by the device to the operation of the hand or device in the virtual space. The main metaphor methods include ray-casting metaphor, virtual hand metaphor and so on.

Ray-casting is a metaphor for ray-casting, a technology that uses virtual rays to be cast on virtual objects to be grabbed, and the direction of the rays is controlled by the user's hand. In order to solve the problem of selecting remote small objects, some scholars put forward the method of replacing cylindrical light column with cone light column.

Virtual hand metaphor, virtual hand technology is to construct the virtual avatar of the user's hand in the virtual environment, and the position/direction of the virtual hand is controlled by the user's hand (through the tracker). However, due to the limitation of the working range of the human arm, it is difficult to operate the remote objects. Therefore, the go-Go technology based on the nonlinear mapping principle can be used to expand the reachable range of the hand in the virtual space.

Indirect mapping interaction refers to mapping the input information of the device to gestures, and controlling the proportion of scene space through gestures, so as to complete interactive tasks in the new proportion space. In this mapping mode, it is usually necessary to adopt the two-handed interaction and make use of the asymmetric division of labor of two-handed operation, that is, the non-dominant hand is used as the reference coordinate frame, and the advantageous hand performs fine operations relative to the non-dominant hand. The main metaphor methods include: Worlds In Miniature (WIM), Scaled-World Grab, etc.

Worlds in Miniature (WIM), THE INTERACTIVE metaphor of WIM uses two hands to operate, and two hands respectively hold the panel and button props with the tracker. The panel props are metaphorically represented as the microcopy of the entire virtual scene in virtual space. The objects in the microcopy are related to the objects in the scene, and the operation of the scene objects is realized in the microspace.

Scale-world Grab, an interactive technology that uses proprioception to compensate for the lack of haptic feedback in immersive virtual environments. The idea is that when a user grabs a virtual object, the scene automatically shrinks relative to the user's head, and when the user releases the object, the proportion automatically rolls back.

2.2 Gesture Interaction

In virtual reality, the movement of the user's body or limbs can serve as an important input channel. By tracking relevant parts of the human body (such as head, hand, arm or leg) through the tracker or computer vision method, the movement Posture information of the human in the physical world is obtained as the input of the VIRTUAL reality system, which is interpreted as Gesture or Posture by the recognition algorithm, collectively referred to as Gesture below. This is one of the most important virtual reality input methods.

According to the input devices used, the main gesture and gesture input technologies can be divided into: glove based gesture recognition, video based recognition, inertial sensor unit based gesture recognition technology, etc.

For glove-based gesture recognition, the original data obtained from the glove device can be analyzed by using recognition algorithms such as hidden Markov model and neural network, and the glove gesture recognition method based on sliding window can achieve high accuracy. The hand can be used as a button, calculator, locator, and pickup device. Pinch gloves can be used to make limited hand shapes, and data gloves can also be used to provide hand shapes and gestures by using joint Angle measurements.

Video-based recognition, video images of hand or finger postures and limb or head can be used by computer vision algorithms to identify specific postures. Microsoft's Release of the Kinect[4] depth camera revolutionized video-based gesture interaction. Kinect mainly uses an ordinary RGB camera and an infrared camera for depth detection to obtain human RGB images and depth images. Based on depth image data, the accuracy and speed of tracking various parts of human bones, hands and heads have been greatly improved, which provides a foundation for gesture interaction to become practical.

Gesture recognition technology based on inertial sensor unit. Through the inertial sensor unit (IMU) worn on the hand to obtain the hand or joint motion Angle, acceleration and azimuth potential and other information, further through the method of pattern recognition to obtain the user's gesture. Typical examples are Nintendo's Wii Remote, SONY's PS4 gamepad, and most smartphones.

2.3 Voice Interaction

Voice input is a natural input method that combines different types of input technologies (i.e., multi-channel interaction) to create a more coherent and natural interface. Voice input can be a valuable tool in virtual reality user interfaces if it is properly used, especially if both hands are occupied. Voice has many desirable features: it frees the user's hand; Adopt an unused input channel; Allows efficient and accurate input of large amounts of text; Is completely natural and familiar. In VIRTUAL reality user interfaces, voice input is especially suitable for non-graphical command interaction and system control, where users issue voice commands to request the system to perform specific functions, change the interaction mode or system state.

The key to Speech interaction is the Speech recognition engine, some existing Speech recognition software including Microsoft Speech API, IBM ViaVoice, Nuance and domestic Iflytek, they have achieved very good performance. Nowadays, with the gradual opening and open source of speech recognition technology, the threshold of speech technology is gradually lowered. Major open source voice interactive platforms include CMU-Sphinx, HTK-Cambridge, Julius and RWTHASR, etc. In recent years, the rise of Google Glass, wearable devices, smart home and in-car devices has pushed speech recognition to the forefront of applications. Since Siri was added to Apple's iPhone 4S, almost all phones have built-in voice-assistant apps. Currently, representative voice assistant products include smart speaker Google Home and voice assistant Amazon Echo in foreign countries, and xiaomi smart speaker Xiao Ai Classmate in China.

2.3 Force/touch interaction

Compared with traditional visual interaction and auditory interaction, tactile interaction can create a more real sense of immersion for users, and plays an irreplaceable role in the interaction process. Force/tactile interaction in traditional human-machine interface is regarded as a special input/output mode in interface interaction. As input devices, they are used to capture user movements, and as output devices, they provide users with tactile experience. The force/touch interaction in virtual reality focuses on natural interaction, which is an important development direction of human-computer interaction in the future.

At present, the representative force/tactile interaction technology research includes Microsoft's 3D tactile feedback touch screen, which includes an LCD flat screen, multiple force sensors and a robot arm that can move back and forth. When the user touches the screen with his hand, the force sensor will capture the user's force applied information. Combined with other parameters, the robot arm moves the screen smoothly to generate force feedback and object shape touch.

2.4 Multi-channel interaction in VIRTUAL reality

Multi-channel interaction refers to a cooperative way in which two or more input channels (such as voice, video, touch and gesture, etc.) are combined in a system. Because it makes full use of different sensory channels of human beings, the interaction is more natural and effective. In multimodal user interface, user can use natural interaction such as voice, gestures, eye contact, facial expressions and lip motion to work together with the computer system, and the machine is considered a active participants, information exchanges between input channels such as serial/parallel and complementary/independent a variety of ways, the human-computer interaction to the form of human interaction, It greatly improves the nature and efficiency of interaction, and will be the mainstream form of human-computer interaction in virtual reality in the future.

Using multi-channel interaction in virtual reality has the following advantages:

First, to reduce the coupling degree, using input channels different from the main input channel for virtual interaction, can reduce the cognitive load of users. If users do not need to switch between operations and system control, they can always focus on the main operation.

The second is error reduction and correction. Using multiple input channels is very effective when the input is ambiguous or intrusive, especially when recognition based input (such as voice or gesture) is applied. Multi-channel input union can significantly improve the recognition rate.

The third is flexibility and behavior supplement, to complete the same task, if the user uses multiple channels, the input will be more flexible.

Fourth, the control of intelligence resources, using multi-channel interaction can reduce cognitive load, but access to multiple intelligence resources at the same time may also lead to low efficiency of interaction.

2.5 Multi-channel Interaction information Fusion

Multi-channel systems allow users to interact simultaneously using different channels, usually based on voice, gesture, or tactile input. In addition, things like facial expression recognition or lip-reading are also used for multi-channel input. Multi-channel interfaces can combine the benefits of individual channels or transform channels according to the context of the environment. Multi-channel interaction technology can greatly improve the system control performance in virtual reality because it integrates the input streams of multiple channels.

There are two main ways of multi-channel fusion: early fusion and late fusion. Early fusion, also known as feature fusion, is based on the original input data fusion at the signal level, this way is suitable for the fusion channel is tightly coupled, such as speech plus lip reading; Late fusion is also called semantic fusion, which is the process of mapping input data to semantic interpretation: firstly, the input information flow is obtained from the input channel, and a unified data representation is constructed through preliminary preprocessing. The late fusion strategy is usually adopted for speech and postures based fusion.

3. Problems and Prospects

Based on the above description and analysis of existing VIRTUAL reality interaction technologies, human-computer interaction in virtual reality environment still faces the following challenges:

One is the problem of natural interaction behavior and state perception of users in virtual environment. Even though different types of sensors can collect human appearance, eeg, physiological and biochemical parameters, voice and other information, but these information is only human natural behavior or external performance, for the user's real thoughts, behavior is not accurate portrayal.

The second is the adaptive problem of interactive output feedback in virtual reality. The feedback information generated by virtual environment needs to be perceived by people, and the perception channel of people is polymorphic. How to generate feedback presentation in accordance with the cognitive processing mechanism in the multi-modal and time-varying output space is the core problem to realize interactive feedback in virtual reality.

Therefore, it is expected that virtual reality will make breakthrough progress in the following aspects in the future:

The first is to develop new perceptual devices to provide basic theoretical support for the research of virtual reality human-computer interaction technology.

Second, explore the multi-channel interaction mechanism, study the multi-channel feedback presentation and adaptive adjustment methods, and build a harmonious and natural virtual reality man-machine environment.


[1] Wang Cheng. Theory, Realization and Application of Lingjing (Virtual Reality) Technology [M]. Tsinghua University Press, 1996.

[2] LI Chenlong. Research on human-computer Interaction technology in Virtual Reality System [D]. Zhejiang University,2017.

[3] Zhang Fengjun, Dai Guozhong, Peng Xiaolan. Science China: information science,2016,46(12):1711-1736.

[4] Zhang Z. Microsoft Kinect sensor and its effect. In: Proceedings of IEEE. Piscataway: IEEE, 2012.4-10.