Speech processing applications in medicine: Study of spectrum and noise as indicators of Lajlyngeal pathology and clinical depression

Kumara, Shama (2006) Speech processing applications in medicine: Study of spectrum and noise as indicators of Lajlyngeal pathology and clinical depression. Phd. Thesis thesis, Manipal Institute of Technology, Manipal.

[img] PDF
th6.pdf - Submitted Version
Restricted to Registered users only

Download (57MB) | Request a copy


Speech, one of the important means of communication, has been the subject of intense study and research for many decades. In particular, the acoustic analysis of speech signal has emerged as a potential technique for the evaluation 'of voice and early detection and diagnosis of laryngeal pathology. This noninvasive technique has proved to be an effective tool for the clinicians in the prediagnosis of larynx diseases and could be the quickest low cost indicator of possible voice malfunctions. Acoustic processing of speech also plays an important role in the analysis and evaluation of emotional disorder, particularly clinical depression. This thesis is focused on an explorative study on the speech processing applications in medicine. Spectral and noise features of speech signal are proposed as indicators of laryngeal pathology and clinical depression. Harmonic to noise ratio (HNR) and critical band energy spectrum have been used as indicators of laryngeal pathology whereas vocal jitter, formant frequencies,formant bandwidths and power distributions are considered for the analysis of depressed speech. The signal processing methods and algorithms used to extract these features have been presented in detail in this thesis. The extracted features are used for the differential classification of pathological and normal voices. The main classification technique is the k-means nearest neighbor classifier and the performance of this classifier is tested on a prior labeled database consisting of normal and pathological voice samples. A multilayer perceptron (MLP) neural network based classifier using critical band energy spectrum is also designed and tested for the classification. These methods could be used as tools to supplement the perceptual evaluation of speech for the detection of suspected laryngeal pathologies. The thesis also presents the results of the discriminant analysis of the acoustic features for differentiating clinical depression from normal. The efficacy of these parameters in discriminating depressed speech from normal is tested using a newly constructed database of speech samples. A brief review encompassing the anatomy and physiology of speech production, different types of speech sound and some. characteristics of pathological and depressed speech is outlined in chapter 1. One of the aims of the present work is the investigation of the performance of acoustic features in discriminating between normal and pathological voices. Therefore, a concise survey of the existing acoustic analysis techniques :for characterizing the pathological voices is included in chapter 1. Existing literature review of the study of acoustic properties of speech in depression is also included in this chapter. Pathological voices usually contain higher aperiodic (noise) components mainly due to the malfunctioning of vocal folds. The relative noise level is therefore one of important acoustic properties that describe pathological voices. The quantification of the noise in pathological voice is mainly made in terms of HNR and normalized noise energy (NNE). In the present work, HNR is considered as one of the acoustic indicators of laryngeal pathology. The estimation of HNR, the design of a simple k-means nearest neighbor classifier and differential classification of normal and pathological voices are considered in chapter 2. Using the k-means nearest neighbor classifier, it is showed that HNR at four different frequency bands is very effective in screening pathological voices. The harmonic and noise energies in voiced speech are estimated by decomposing speech signal into harmonic and noise components using a signal extrapolation algorithm. The fundamental frequency of voicing is accurately estimated by identifying glottal closure instant or stronger excitation location in each glottal cycle using wavelet band pass filtering function. These algorithms are described in chapter 2. The k-means nearest neighbor classifier classifies the given speech sample into pathologic or normal class by comparing the HNR vector, of test sample with class specific pattern (centroid vector). The centroid vectors are obtained during the training phase of the classifier as the mean of HNR vectors corresponding to some fixed number of samples belonging to a particular class. This classifier is easy to implement with minimum computational cost. The results of the performance of the classifier when tested on speech samples from a prelabeled database have indicated that HNR at four frequency bands is a good indicator of laryngeal pathology. An overall classification accuracy of 94.28% is reported using 53 normal and 163 pathological voice signals. Extending the idea of using noise components in voiced speech, we have proposed critical band energy spectrum as another acoustic indicator of laryngeal pathology. The differential classification of normal and pathological voices using this parameter is described in chapter 3 of the thesis. Instead of using noise energy relative to harmonic energy, the effect of noise on spectral distribution is made use of to discriminate pathological voice from normal. This is an extension of the idea of using the effect of added noise on the spectral distribution reported in an earlier work to distinguish clean speech from noisy speech. This works well for pathological voice, as the pathological speech is assumed to have additional noise components. Normalized energies at 21 critical frequency bands have formed the features for the classification between normal and pathological voices. A filter bank of Butterworth band pass filters is used to filter the speech and to estimate the energy vector. The k-means nearest neighbor classifier tested on the same data set gave an overall accuracy of 92.38% in discriminating pathological voice from normal. Though this result is slightly poorer than the HNR based classification, this method is computationally less expensive. The critical bands used have the center frequencies and band widths that roughly correspond to human auditory neurons. Thus the proposed automated analysis mimics the perceptual evaluation of pathological voice. Instead of quantifying voice sample by a single critical band energy vector, we also used short term critical band energy vectors for the classification. Each speech sample signal is segmented into 20 millisecond frames and the critical band energy vectors of all these frames are used as features to train a multilayer perceptron network (MLP). This improved the overall classification accuracy to 93.39%. In the next chapter (chapter 4), acoustic analysis of depressed speech is presented. The depressed speech has been described as dull, monotone, lifeless and metallic. These perceptual qualities have been associated with vocal jitter,~ formant structure and power distributions. We examined these acoustic properties and studied whether any of these discriminate voices recorded under clinical depression from normal. The statistical analysis indicated that all parameters studied were significantly different for depressive speech and normal group. The vocal jitter, which is the perturbation in fundamental frequency, is estimated by tracking the pitch variation for voiced speech segments taken from continuous running speech. In order to detect the voiced segments, a Dyadic Wavelet Transform (D,WT) based algorithm was used. The formant frequencies and bandwidths were estimated using Linear Predictive Coding (LPC) analysis while the power distribution in four frequency bands ranging from 0 to 2 KHz was obtained using Welch method of power spectral density estimation. The efficacy of all these speech parameters in differentiating depressed speech from normal is tested using our own database of speech samples. The database was developed from speech samples recorded from normal male individuals and male individuals diagnosed for major depression belonging to Kannada (one of the south Indian languages) speaking local population. For this data set, the vocaljitter and features describing the power spectral distribution emerged as good indicators of depression. The thesis is concluded by summarizing the results of the analysis and by suggesting the areas of scope for further investigation.

Item Type: Thesis (Phd. Thesis)
Subjects: Engineering > MIT Manipal > Electronics and Communication
Depositing User: MIT Library
Date Deposited: 02 Jun 2014 05:57
Last Modified: 07 Nov 2014 09:38
URI: http://eprints.manipal.edu/id/eprint/139452

Actions (login required)

View Item View Item