This paper proposes a novel feature type for the recognition of emotion from speech. We explored different gabor feature implementations, along with different speaker recognition approaches, on rossi 1 and nist sre08 databases. Gabor features have been used mainly for automatic speech recognition asr, where they have yielded improvements. Apr 11, 2020 physics being a function of both time and frequency or wavelength 2008 november 12, georg b.
Spectraltemporal receptive fields and mfcc balanced. Arraybased spectrotemporal masking for automatic speech. Speaker recognition is unobtrusive, speaking is a natural process so no unusual actions are required. A comprehensive treatment of such spectro temporal integration of speech as it relates to aging is lacking. In this paper, we present a new technique to extract a noise robust representation of speech signals called spectrotemporal power spectrum. Modelling, feature extraction and effects of clinical environment a thesis submitted in fulfillment of the requirements for the degree of doctor of philosophy sheeraz memon b. Spectrotemporal refers most commonly to audition, where the neurons response depends on frequency versus time, while spatiotemporal refers to vision, where the neurons response depends on spatial location. Speaker recognition 24 63 % 32 % 28 % 32 62 % speech recognition, i. Apr 20, 2020 our gui has basic functionality for recording, enrollment, training and testing, plus a visualization of realtime speaker recognition. Noise robust automatic speech recognition based on. Gabor filterbank features for robust speech recognition.
Arraybased spectro temporal masking for automatic speech recognition submitted in partial ful llment of the requirements for the degree of doctor of philosophy in electrical and computer engineering amir r. The strf is derived from physiological models of the mammalian auditory system in the spectraltemporal domain. Prosodic features prosody refers to nonsegmental aspects of. An overview of textindependent speaker recognition. Index termsspeaker recognition, gaussian mixture model, feature extraction, expectation maximization, timit database. Multistream spectro temporal features in a tandem recognition system figure 3 illustrates the usage of spectro temporal features in a mlpconventionalspeechrecognizer tandem system 10 to carry out the recognition experiments. Communication systems and networks school of electrical and computer engineering. Feature extraction techniques in speaker recognition. The microphone signal was fed to a custommade, realtime song recognizer that detected the first stereotypic syllable of song motifs using a twolayer neural network trained on spectrotemporal song data. In this work we built a lstm based speaker recognition system on a dataset collected from cousera lectures. Spectrotemporal cues enhance modulation sensitivity in. As suggested by, the strf can be effectively modelled by two. Localized spectrotemporal features for automatic speech.
This paper proposes a speaker recognition system using acoustic features that are based on spectraltemporal receptive fields strfs. As a starting point, the properties of lstf features 1 are evaluated. These features are then combined to obtain joint spectro temporal features which are used for posterior based speech recognition system. As speech spectral peaks constitute the regions of highsnr signaltonoise ratio values in the speech spectrogram, we expect. The resulting features showed a close resemblance to the strfs of cortical neurons in the auditory system. For successful speaker recognition, understanding of the principles of human speaker recognition is essential and therefore speaker recognition should include a close study of clues that are used by humans in recognizing the speaker. Finding the stable features of voice is therefore the most important task for speaker recognition.
Spectrotemporal modulation transfer functions mtfs are derived as a function of ripple peak density omega cyclesoctave and drifting velocity omega hz. Spectrotemporal analysis of speech using 2d gabor filters. Other biologically inspired spectrotemporal speech features, e. Designing of gabor filters for spectrotemporal feature. Speaker recognition for forensic applications this work was sponsored under air force contract fa872105c0002. Our gui has basic functionality for recording, enrollment, training and testing, plus a visualization of realtime speaker recognition. High level featuresthese features attempt to capture. Spectraltemporal receptive fields and mfcc balanced feature. Arraybased spectrotemporal masking for automatic speech recognition submitted in partial ful llment of the requirements for the degree of doctor of philosophy in electrical and computer engineering amir r. The joint spectrotemporal features adaptively capture. Spectrotemporal directional derivative features for. Input audio of the unknown speaker is paired against a group of selected speakers, and if a match is found, the speakers identity is returned. In this work, we have investigated the performance of 2d gabor features known as spectro temporal features for speaker recognition.
The strf is derived from physiological models of the mammalian auditory system in the spectral temporal domain. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the united states government. Use advanced ai algorithms for speaker verification and speaker identification. They are compared to shortterm spectral features as well as popular prosodic features. Note that realtime speaker recognition is extremely hard, because we only use corpus of about 1 second length to identify the speaker. Spectrotemporal gabor features for speaker recognition howard lei, bernd t. Spectrotemporal modulation transfer functions mtfs are derived as a function of ripple peak density. The spectrotemporal receptive field or spatiotemporal receptive field strf of a neuron represents which types of stimuli excite or inhibit that neuron. Spectro temporal modulation transfer functions mtfs are derived as a function of ripple peak density omega cyclesoctave and drifting velocity omega hz. Spectrotemporal gabor features as a front end for automatic speech recognition pacs reference 43. Spectrotemporal gabor features as a front end for automatic. Spectrotemporal gabor features for speaker recognition conference paper pdf available in acoustics, speech, and signal processing, 1988. The term voice recognition can refer to speaker recognition or speech recognition. Multistream spectrotemporal features in a tandem recognition system figure 3 illustrates the usage of spectro temporal features in a mlpconventionalspeechrecognizer tandem system 10 to carry out the recognition experiments.
Asr phoneme recognition rates for different speaking rates, efforts and style for mfcc and spectro temporal gabor features, obtained with the oldenburg logatome corpus. Nov 22, 2016 in this paper, we present a new technique to extract a noise robust representation of speech signals called spectro temporal power spectrum. Spectrotemporal modulation spectrogram neurophysiological studies suggest that the response of neurons in the primary auditory cortex of mammals are tuned to specific spectrotemporal patterns theunissen2001. Multistream spectrotemporal features for robust speech. Introduction measurement of speaker characteristics. Robustness of spectrotemporal features against intrinsic. This response characteristic of neurons can be described by the socalled strf. In this paper, the localized spectrotemporal features lstf are analyzed further with. This capability of dnns in learning taskoriented features can be utilized to learn speaker discriminative features as well. The effect of bioinspired spectrotemporal processing for automatic speech recognition asr is analyzed for two different tasks with focus on the robustness of spectrotemporal gabor features in. For textindependent speaker identification a prominent combination is to use gaussian mixture models gmm for classification while relying on melfrequency cepstral coefficients mfcc as features. Speaker verification apis serve as an intelligent tool to help verify speakers using both their voice and speech passphrases. Similar techniques are widely used in the visual domain.
Optimization of gabor features for textindependent. Meyer, and nikki mirghafori international computer science institute 1947 center street, suite 600 berkeley, ca 94704, usa abstract in this work, we have investigated the performance of 2d gabor features known as spectrotemporal features for speaker recognition. The four spectro temporal feature streams are each fed into a multilayer. Robust speech recognition based on spectrotemporal features. Speaker identification apis allow you to identify who is speaking based on their voice, supporting scenarios such as conversation transcription. Pdf speaker emotion recognition based on speech features. Therefore, the purpose of this study is to assess the integration of sparse speech as a function of listener age, where the speech snippets are variously isolated in both the time and frequency domains, as well as in ear of presentation. The various technologies used to process and store voice prints include frequency estimation, hidden markov models, gaussian mixture models, pattern matching algorithms, neural networks, matrix representation, vector quantization and decision trees. When speaker recognition is used for surveillance applications or in general when the subject is not aware of it then the common privacy concerns of identifying unaware subjects apply.
Speaker recognition is a pattern recognition problem. Spectrotemporal power spectrum features for noise robust asr. A summary of features from viewpoint of their physical interpretation. In this work, i have concentrated on mfccs and lpcs. In addition to measuring utterance and vowel durations, this study quantified the degree of coarticulation by calculating the absolute formant frequency changes in. This technique is based on applying a simple 2d filter to the speech spectrogram to highlight the movements of spectral peaks. With the strf, a signal is expressed by rate in hz and scale in cyclesoctaves.
Localized spectro temporal gabor features for automatic speech recognition the strf of cortical neurons and early auditory features derived in psychoacoustic experiments can be approximated, although somewhat simpli. The spectrotemporal envelope s d b t, x, in db of each moving ripple is defined by 1 s d b t, x m 2 sin 2. The features are derived from a longterm spectro temporal representation of speech. New features to improve speaker recognition efficiency with. Feature extraction choosing which features to extract from speech is the most significant part of speaker recognition. This capability of dnns in learning taskoriented features can be utilized to learn speakerdiscriminative features as well. Signal details such as formant transitions and energy modulations contain useful speaker specific information. Experimental results with the berlin emotional speech database show that the proposed. The spectro temporal receptive field or spatio temporal receptive field strf of a neuron represents which types of stimuli excite or inhibit that neuron.
Spectrotemporal modulation transfer functions and speech. Spectro temporal features it is very much a rational assumption that the spectro temporal. Twodimensional gabor filter functions are applied to a spectro temporal representation formed by columns of primary feature vectors. Shortterm spectral features, as the name suggests, are computed from short frames of about 2030 figure 2. In kleinschmidt, 2002a the usage of 2dimensional gabor. Speaker recognition introduction measurement of speaker characteristics construction of speaker models decision and performance applications this lecture is based on rosenberg et al. Pdf spectrotemporal gabor features for speaker recognition. Original speaker recognition systems used the average output of several analog filters to perform matching, often with the aid of humans in the loop. This paper proposes a speaker recognition system using acoustic features that are based on spectral temporal receptive fields strfs. New features to improve speaker recognition efficiency. Noise robust automatic speech recognition based on spectro. This concept of spectrotemporal modulation decomposition has inspired many approaches in various engineering topics, such as using spectrotemporal modulation features for speaker recognition 12, robust speech recognition 18, voice activity detection 10, and sound.
Input audio of the unknown speaker is paired against a group of selected speakers, and in the case there is a match found, the speakers identity is returned. Speaker recognition is the identification of a person from characteristics of voices. Mathur s, choudhary sk, vyas jm 20 speaker recognition system and its forensic implications. Spectro temporal refers most commonly to audition, where the neurons response depends on frequency versus time, while spatio temporal refers to vision, where the neurons response depends on spatial location versus time. The mtfs exhibit a lowpass function with respect to both dimensions, with 50% bandwidths of. A recent study shows that this is possible at least on textdependent tasks 8. The effect of bioinspired spectro temporal processing for automatic speech recognition asr is analyzed for two different tasks with focus on the robustness of spectro temporal gabor features in. Speaker recognition has been studied actively for several decades. Methods for capturing spectrotemporal modulations in. Speaker recognition or broadly speech recognition has been an active area of research for the past two decades. Decomposing spectrotemporal sounds by nmf audio signals sampled in the time domain, are transformed into spectrograms which represent timedependent spectral energies of sounds. The experiments with the htk recognizer were performed with different snrs matched training and testing.
The api can be used to determine the identity of an unknown speaker. In order to improve the performance of automatic speech recognition asr systems a number of di. The hypothesis was that the integration of speech fragments distributed over frequency, time, and ear of presentation is reduced in older listenerseven for those with good audiometric hearing. Spectro temporal gabor features for speaker recognition howard lei, bernd t. The mtfs exhibit a lowpass function with respect to both dimensions, with 50% bandwidths of about 16 hz and 2 cyclesoctave. Robust speech recognition based on spectrotemporal processing.
Meyer, and nikki mirghafori international computer science institute 1947 center street, suite 600 berkeley, ca 94704, usa abstract in this work, we have investigated the performance of 2d gabor features known as spectro temporal features for speaker recognition. Localized spectrotemporal gabor features for automatic speech recognition the strf of cortical neurons and early auditory features. Asr phoneme recognition rates for different speaking rates, efforts and style for mfcc and spectrotemporal gabor features, obtained with the oldenburg logatome corpus. The purpose of this study was to determine the effects of age on the spectrotemporal integration of speech. Multistream spectrotemporal features for robust speech recognition sherry y. Other biologically inspired spectro temporal speech features, e.
Spectraltemporal receptive fields and mfcc balanced feature extraction for noisy speech recognition jiaching wang1, changhong lin1, enting chen2, and paochi chang2 1 department of computer science and information engineering, national central university, jhongli, taiwan 2 department of communication engineering, national central university. The joint spectro temporal features adaptively capture. Robustness of spectrotemporal features against intrinsic and. Gabor features have been used mainly for automatic speech. Speaker verification also called speaker authentication contrasts with identification, and speaker recognition differs from speaker diarisation. Learning nonnegative features of spectrotemporal sounds. Spectrotemporal gabor features for speaker recognition. These features are then combined to obtain joint spectrotemporal features which are used for posterior based speech recognition system. Robust speech recognition based on spectrotemporal. These physiologically and psychoacoustically motivated features employ spectrotemporal information inherent to the speech signal.
As speech spectral peaks constitute the regions of highsnr signaltonoise ratio values in the. Spectrotemporal power spectrum features for noise robust. The rate and scale are used to specify the temporal response and. Improved deep speaker feature learning for textdependent. Neurophysiological studies suggest that the response of neurons in the primary auditory cortex of mammals are tuned to specific spectrotemporal patterns theunissen2001.
The gabor wavelet is the most common of these directional. Now, the gabor features are modified to a complex filter and based on melspectra, which is the standard first processing stage for most types of features mentioned above. Detection thresholds for spectral and temporal modulations are measured using broadband spectra with sinusoidally rippled profiles that drift up or down the logfrequency axis at constant velocities. A novel type of feature extraction is introduced to be used as a front end for automatic speech recognition asr. It is investigated whether the use of gabor features may increase the performance of more sophisticated stateoftheart asr systems. The features are derived from a longterm spectrotemporal representation of speech. Spectral features for automatic textindependent speaker. Introduction feature extraction is the key part of the frontend.