Immersive monitoring: From perception to practice

Hearing the world around us is so natural that we often only notice its importance once we lose the ability. Most of the time, a loss is fortunately temporary, for instance caused by a cold, but a one-side hearing loss is more stressful and depressing than we generally tend to believe.

Localisation makes use of the most energy- consuming and fast-firing synapses of the brain, so the capability has been important for our survival. One of the first things a baby does is to localise, quickly and automatically turning eyes towards a sound. Until adolescence, we further learn and refine localisation using a system under construction. Ear canals and other structures of the outer ear (“pinnae”) grow and reshape, constantly modifying spherical hearing, as we reach out and experience a fascinating world in return. Pinnae continue to be entirely personal.

They are actually under development throughout life, though the rate of change slows in adults. Sound is colored by the pinnae, depending on its direction of arrival (azimuth), which is a highly important feature. Expert listeners constantly use it in combination with head movements; not only when evaluating immersive content but also to distinguish direct sound from room reflections.

Personal head related transfer functions (HRTFs) enable localisation, considering frequencies above 700 Hz. That is the frequency range where interaural level difference (ILD) is of primary concern. From 50 Hz to 700 Hz, however, fast-firing synapses in the brainstem are responsible for localisation, employed in a phase-locking structure to determine interaural time difference (ITD).

The ability to position sound sources with precision spherically is a key benefit of immersive systems. Another is the possibility to influence the sense of space in human listeners. For the latter, the lowest two octaves of the ITD range (i.e. 50-200 Hz) play an essential role; but it may be compromised in multiple ways: Microphones with not enough physical spacing during pick-up, synthesized reverb without the right kind of decorrelation, lossy codecs that collapse channel-differences, loudspeakers with limited LF capability, bass-management etc. So what does perception ask from immersive reference monitoring, besides from enough discrete channels and headroom?

Multichannel monitoring standards like ITU-R BS.775 specify using the same model for all channels, but more importantly, each monitor should be adjusted for placement, or frequency responses will be extremely variable. Unpredictable responses disturb localisation and head movements; and fig 1 shows how wildly different identical speakers may behave in an actual room. If left at factory default, the red curves represent the resulting frequency responses.

A deviation of 20 dB is clearly not compatible with reference listening conditions, so in-situ compensation per monitor is a first requirement, regardless of which brand is used. The second is well-controlled directivity to avoid off-axis coloration of direct sound, and to prevent coloration of reflections in general. Point source monitors can further improve both properties and may be considered, if not in general, then for the primary channels.

The third requirement is consistent listening levels, because subjective differences in frequency response would otherwise be of the same magnitude as the objective differences created by not correcting monitors for placement, detailed above. Calibrated listening level is therefore a requirement in film, drama and gaming standards, and it also helps ensuring speech intelligibility.

The fourth requirement is a frequency response down to at least 50 Hz in all channels, in order to to be able to manipulate the sense of space effectively. Correlated reverb with a high Q ringing at low frequency conveys a small room, while an enormous s pace can be created with reverb fully decorrelated at low frequency in as many channels as possible. All in all, a well-aligned loudspeaker system in a fine room has the best chance of translating well to a variety of immersive playback situations.

The sound engineer can make full use of outer ear features and head movements, with listener fatigue and “cyber sickness” minimised. In case headphone-based monitoring is usedfor immersive production, it should incorporate precise, personal HRTFs and head tracking around a n-channel virtual reference room. Even so, any static or temporal imperfection can lead to listener fatigue, and head movements in production are unlikely to produce anywhere near the same results as during reproduction across platforms.