Commonsense Question Answering Task (1/25 teams), Conversational Entailment Task (1/25 teams), Every Actions in Text Task (2/25 teams)

TL;DR

  • Applied transformer-based language models BERT, ALBERT, RoBERTa, and ELECTRA on tasks.
  • Solved the CQA task by representing the question jointly with each candidate answer and used pre-trained language models as the main encoder. The candidate answer that corresponds to the highest-scored sentence is taken as the output.
  • Tackled the Conversational Entailment task by concatenating each full dialogue and the hypothesis and used pre-trained language models with feed-forward classifiers as the main encoder.
  • Leveraged pre-trained models that are first fine-tuned on other textual entailment datasets to bring in external knowledge.

CommonsenseQA

  • When humans answer questions, they capitalize on their commonsense and background knowledge about spatial relations, causes and effect, scientific facts and social conventions (Talmor et al., 2019). Commonsense Question Answering is a multiple-choice question answering dataset requiring different commonsense knowledge to predict the correct answers out of 5 natural language choices.

CommonsenseQA

  • We experiment this task with pre-trained models like BERT (Devlin et al., 2019), ALBERT (Lan et al., 2020), and RoBERTa (Liu et al., 2019) as baseline models. We solve the problem by representing the question jointly with each candidate answer and use pre-trained language models as the main encoder. Scoring of each sentence is based on a sentence-level hidden vector, and the candidate answer that corresponds to the highest-scored sentence is taken as the output. Moreover, we apply the knowledge chosen by relation from Concept-Net and descriptions form Wiktionary into input setup, which are important to improve commonsense reasoning. The knowledge graph contains rich structural information and external entity description provides contextual information to graph entities.

  • Input schema:

    • If we can find a relation from question concept to answer concept, then the input schema will be question choice [SEP] q_concept: q_description [SEP] a_concept: a_description [SEP] relation triplet.
    • Otherwise, the input schema will be question choice [SEP] q_concept: q_description [SEP] a_concept: a_description.
  • Results on Commensense QA validation set. Q&A refers to use question and answer as the input of the model. REL. indicates that the model further includes relations from the ConceptNet as the input and DES. shows that the model adds descriptions from the Wiktionary in the input. Best metric is in boldface:

CommonsenseQA Result

Conversation Entailment

  • Conversational Entailment is a task that determines whether a given short natural language conversation discourse entails a hypothesis about the participants. In each of these dialogues, two participants discuss a topic of interest (e.g., sports activities, corporate culture, etc.). Each dialogue also provides available annotations such as syntactic structures, disfluency markers, and dialogue acts.

Conversation Entailment

  • Conversation entailment provides an intermediate step towards acquiring information about conversation participants. This task falls into the traditional formulation of textual entailment (Dagan et al., 2005; BarHaim et al., 2006; Giampiccolo et al., 2007), and pre-trained models have shown impressive performance on textual entailment. Therefore, we use the pre-trained models with feed-forward classifiers to tackle this task. We solve this task specifically by concatenating each full dialogue and the hypothesis. Inside the full dialogue, we prefix each dialogue turn with the speaker information.

  • Input schema:

    • For a dialogue turn by speaker X, we convert it to “SpeakerX: dialogue turn”. The converted dialogue turns are concatenated to form the full dialogue. The final input schema is “[CLS] full dialogue [SEP] hypothesis [SEP]”.
  • With limited available training data, we find that utilizing existing large benchmark textual entailment datasets such as Multi-NLI (Williams et al., 2018) and FEVER (Thorne et al., 2018) can be helpful to improve the model performance for entailment task. Moreover, optimizing the representation of speaker information further boosts the prediction accuracy.

  • Performance of different models on Conversation Entailment. Mean and standard deviation of the metric over 5 cross-validation folds are reported. The best metric is in boldface:

Conversation Entailment Result

Every Actions in Text

  • Every Actions in Text is a task that given a short natural language story, determine whether the story is physically plausible, and if implausible, which sentence is the breakpoint, i.e., the sentence after which the story stops making sense?

EAT

  • We model the two sub-tasks in EAT as one unified textual entailment task. In this way, we can similarly adopt pre-trained models with linear classifiers to predict the entailment label and tackle the problem. To construct the textual entailment input to the pre-trained models, given a story, we take one of the sentences as the hypothesis for the entailment input and all the preceded sentences as the context. If the sentence forming the hypothesis is a break point of the original story, then the entailment input should be classified as “contradiction” and the original story is implausible. Otherwise, it should be classified as “entailment”.

  • Since the training data of EAT is also insufficient, we leverage pre-trained models that are first fine-tuned on other textual entailment datasets to bring in external knowledge.

  • Performance of different models on EAT. Mean and standard deviation of the metrics over 5 cross- validation folds are reported. The best metrics are in boldface:

EAT Result

Future Work

Commonsense QA
  • We are going to apply human written reasoning (Rajani et al., 2019) into the model to improve the knowledge for reasoning.
Conversational Entailment
  • As shown in our current result, adding dialogue act information can assist the model to make correct prediction. However, the improvement is not significant. Therefore, we will study how to better leverage dialogue act information with pre-trained models.
Everyday Actions in Text (EAT)
  • In our current setting, the positive training samples are way more than negative training samples. We observe that a plausible story might become implausible if the order of sentences is changed and thus, we plan to construct more negative training samples by swapping the sentences for each story.

Related