Sarcasm detection presents unique challenges in speech technology, particularly for individuals with dis-orders that affect pitch perception or those lacking contextual auditory cues. While previous research has established the significance of integrating textual, audio and visual data in sarcasm detection, these studies overlook the interactions between modalities. We propose an approach that synergizes audio, textual, sentiment and emotion data to enhance sarcasm detection. This involves augmenting sarcastic audio with corresponding text using Automatic Speech Recognition (ASR), supplemented with information based on emotion recognition and sentiment analysis. Our methodology leverages the strengths of each modality: emotion recognition algorithms analyze the audio data for affective cues, while sentiment analysis processes the text generated from ASR. The integration of these modalities aims to compensate for limitations in current multimodal approach by providing complementary cues essential for accurate sarcasm interpretation. Evaluated on only the audio data of the dataset MUStARD++, our approach has surpassed the state-of-the-art model by 4.79,% F1-score. Our approach improves sarcasm detection in the audio domain, especially beneficial to those with auditory processing challenges. This research highlights the potential of multimodal data fusion in enhancing the subtleties of speech perception and understanding, thus contributing to the advancement of speech technology applications.

This content is only available via PDF.