Sarcasm detection presents unique challenges in speech technology, particularly for individuals with disorders that affect pitch perception or those lacking contextual auditory cues. While previous research [1, 2] has established the significance of pitch variation in sarcasm detection, these studies have primarily focused on singular modalities, often overlooking the potential synergies of integrating multimodal data. We propose an approach that synergizes auditory, textual, and emoticon data to enhance sarcasm detection. This involves augmenting sarcastic audio data with corresponding text using Automatic Speech Recognition (ASR), supplemented with emoticons based on emotion recognition and sentiment analysis. Emotional cues from multi-modal data are mapped to emoticons. Our methodology leverages the strengths of each modality: emotion recognition algorithms analyze the audio data for affective cues, while sentiment analysis processes the text generated from ASR. The integration of these modalities aims to compensate for limitations in pitch perception by providing complementary cues essential for accurate sarcasm interpretation. Our approach is expected to significantly improve sarcasm detection, especially for those with auditory processing challenges. This research highlights the potential of multimodal data fusion in enhancing the subtleties of speech perception and understanding, thus contributing to the advancement of speech technology applications.