AI
In human-computer interaction, it is crucial for agents to respond to human by understanding their emotions. Unraveling the causes of emotions is more challenging. A new task (as shown in Figure 1) named Multimodal Emotion-Cause Pair Extraction in Conversations (MECPEC) is responsible for recognizing emotion and identifying causal expressions. Based on the multimodal conversational emotion cause dataset ECF, MECPEC contains the following two subtasks:
Figure 1. A example of MECPEC task and annotated dataset.
In this study, we propose a multi-stage framework to generate emotion and extract the emotion causal pairs given the target emotion. In the first stage, Llama-2-based InstructERC [1] is utilized to extract the emotion category of each utterance in a conversation. After emotion recognition, a two-stream attention model is employed to extract the emotion causal pairs given the target emotion for subtask 2 while MuTEC [2] is employed to extract causal span for subtask 1. Our approach achieved first place for both of the two subtasks in the competition.
The overview of the architecture of our proposed model is shown in Figure 2. The InstructERC aims to extract the emotion of utterances. TSAM [3] model is a two-stream attention model utilized to extract the causal pairs given the predicted emotion utterance. The MuTEC is an end-to-end network designed to extract the causal span based on the causal pair extraction.
Figure 2. The overview of proposed model framework.
We follow InstructERC which reformulate the ERC task from a discriminative framework to a generative framework and design a prompt template which comprises job description, historical utterance window, label set and emotional domain retrieval module. In order to understand the differences in similar emotions, we design a hierarchical classification label set shown in Figure 3. The emotion labels in dataset can be split into three categories: neutral, positive and negative, which positive set consists of surprise and joy while negative set includes fear, sadness, disgust and anger.
Figure 3. The Hierarchical Structure of Emotion labels.
We add three auxiliary tasks in training data: sub-label recognition, positive recognition, and negative recognition tasks. The instruct template is depicted in Figure 4. For sub-label recognition (SR), positive recognition (PR) and negative recognition (NR) tasks, we utilize the corresponding label set to replace the label statement separately. Visual data also plays an essential role in ERC. For video clips, we utilize LLaVA to generate descriptions of background, speaker movement and personal state. Therefore, we add background description, movement description and personal state description in instruct template.
Figure 4. The Schematic of Instruct Template for ERC.
In our pipeline framework, for Subtask2, we first extract the emotion of the utterance and then extract the causal pairs given the emotional utterance in a conversation. The causal pairs extraction is typically modelled as the causal emotion entailment (CEE) task. In our system, we employ TSAM as the causal pair extractor. TSAM mainly comprises three modules: Speaker Attention Network (SAN), Emotion Attention Network (EAN), and Interaction Network (IN). The EAN and SAN integrate emotion and speaker information simultaneously, and the subsequent interaction module efficiently exchanges pertinent information between the EAN and SAN through a mutual BiAffine transformation [4].
Emotion cause span extraction aims to extract the start position and end position of the causal utterance in a conversation. Typically, we can utilize a pipeline framework which firstly predicts the emotion and then predicts the cause span. we follow MuTEC and use an end-to-end framework in a joint multi-task learning manner to extract the causal span in a conversation. During the training period, the input comprises the target utterance Ut, the candidate causes utterance Ui, and the historical context. MuTEC employs a pre-trained model (PLM) to extract the context representations. For emotion recognition, which is an auxiliary task, it employs a classification head on the top of the PLM. The end position is predicted by the prediction head of the concatenated representations of the given start index and the sequence output from the PLM. In this stage, the golden start index is used as the start index.
We use weight average F1 score and accuracy to evaluate the performance of the model. It should be noted that according to the rules of the competition, we removed the neutral utterances when computing F1 score and accuracy. The result of ERC on test set is shown in Table I. The best weight average F1 score is 58.64, which is achieved by Llama-2-13B with historical clips descriptions.
Table 1. Results of ERC task on test set without neutral utterances.
We utilize an end-to-end framework for cause span extraction and achieve a final performance of 32.23 in weighted average proportional F1 score on the official evaluation dataset as is shown in the Table II. Our result significantly surpasses the result of 26.40 above ~+6.0 achieved by the second-place participant. Furthermore, our results achieved the highest scores across all other official evaluation metrics, validating the effectiveness of our approach for subtask 1.
Table 2. Results of our models for the causal emotion entailment subtask.
As is shown in Table II, after incorporating the MIN, our positive F1 score improves by +1.2. Furthermore, with the introduction of emotional multi-task learning as an auxiliary task, our result sees an additional improvement of +0.4. Furthermore, we achieve an additional improvement of approximately ~+1.1 in the official final evaluation dataset through model ensembling.
In this paper, we propose a joint pipeline framework for Subtask1 and Subtask2. Firstly, we utilize the Llama-2-based Instruct ERC model to extract the emotional content of utterances in a conversation. Next, we employ a two-stream attention model to identify causal pairs based on the predicted emotional states of the utterances. Lastly, we adopt an end-to-end framework using a multi-task learning approach to extract causal spans within a conversation. Our approach achieved first place in the competition, and the effectiveness of our approach is further confirmed by the ablation study. In future work, we plan to explore the integration of audio and visual modalities to enhance the performance of the task.
Paper : https://arxiv.org/abs/2404.16905
[1] Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, and Sirui Wang. 2023a. Instructerc: Reforming emotion recognition in conversation with a retrieval multi-task llms framework. arXiv preprint arXiv:2309.11911.
[2] Ashwani Bhat and Ashutosh Modi. 2023. Multi-task learning framework for extracting emotion cause span and entailment in conversations. In Transfer Learning for Natural Language Processing Workshop, pages 33–51.
[3] Duzhen Zhang, Zhen Yang, Fandong Meng, Xiuyi Chen, and Jie Zhou. 2022. TSAM: A two-stream attention model for causal emotion entailment. arXiv preprint arXiv:2203.00819.
[4] Timothy Dozat and Christopher D Manning. 2016. Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734.