Assignment 2
Question and Answering
In this assignment, you are to propose and implement a QA (Question Answering) framework
using Sequence model and different NLP features. The QA framework should have the ability
to read document/text and answer questions about it. The detailed information for each
implementation step is specified in the following sections. Note that lab exercises would be a
good starting point for the assignment. The useful lab exercises are specified in each section.
1. Data for Assignment 2 [Compulsory]
In this assignment 2, you are asked to use Microsoft Research WikiQA Corpus. The
WikiQA corpus includes a set of question and sentence pairs, which is collected and
annotated for research on open-domain question answering. The question sources was
derived from Bing query logs, and each question is linked to a Wikipedia page that
potentially has the answer. More detail of this data can be found in the paper, WikiQA:
A Challenge Dataset for Open-Domain Question Answering (Yang et al. 2015).
? Download datasets
You will be provided two datasets, including training dataset and testing
dataset. Both dataset contain the following attributes: QuestionID, Question,
DocumentID, DocumentTitle, SentenceID, Sentence, Label (answer sentence, if
label=1). If you want to explore or use full dataset, you can download via the
? Data Wrangling: You need to first wrangle the dataset. As you can see the
following figure 1, each row based on each sentence of the document. You
need to construct three different types of data for training the model: Question,
Document and Answer. To construct the document data, you should
concatenate (with space) each sentence that has same DocumentID. To
construct the answer data, use the sentence that has Label as 1.

