Let the machine summarize The Meetings

6 min readFeb 11, 2021

Forgot to take minutes of the meeting? No Problem.

Introduction

Imagine this, while interacting with your bosses or co-workers for the POA of the next financial quarter on a video call, the summary of the conversation just shows up on your desk. That’s exactly what this article is all about.

It deals with the encapsulating of meetings by summarizing it which is within the style of written dialogues built upon extractive summaries by generating new vocabulary words apart from a text corpus.

Approach

Dataset

The dialogue dataset that was used for the project is the Switchboard Dialogue Act Corpus.

It is a corpus of 1,155 5-minute telephone conversations between 2 persons, annotated with speech act tags. In these conversations, callers question receivers on provided topics, such as child care, recycling, and news media. 440 users participate in these 1,155 conversations, producing 221,616 utterances (we combine consecutive utterances by the same person into one utterance, so our corpus has 122,646 utterances).

Converting Raw Dialogues to Text

Life’s hard and so is preprocessing of Raw Data

CRF Modeling

CRF models the conditional probability and the decision boundary of different classes based on the context of a particular dialogue act and uttered sequence. All the Dialogue Acts(DA) labels will have different transition probability.

It is often challenging to guess the DA of dialogue without knowing what the context of previous and followed dialogues is. For example, it’s difficult to say if “Paris” is to be told as an answer or just a location, So for this, there should be an understanding of the last dialogue whether it was a question or not.

CRF tags depend on all the dialogues context assuming that the features are dependent on each other for learning to make predictions much better.

CRF labelling is used to remove the redundant tags which do not otherwise contribute to the summary like non-verbal tags, acknowledgements etc and also to identify labels as questions and answers.

Anaphora Resolution

Hugging face coreference system in operation. Try it for yourself!

As we are dealing with dialogue format data, it’s essential to distinguish when there are references to other speakers in the dialogues.

To resolve this coreference problem, anaphora resolution can be implemented of the dialogue formatted data and resolving them into a text formatted one based on CRF labelled tags. This is considered as a pronoun resolution task which is a problem when we have more than 2 speakers.

To proceed with anaphora resolution, the following preprocessing steps are carried out.

Eliminating brackets and it's content
Removing punctuations and symbols
Expanding contractions
Concatenating speaker name with its corresponding dialogue using CRF labelling

The quality depends on the spoken dialogue and context of the dialogue associated with each dialogue. For this, take help from CRF tag labelling to get the context.

Proposed Model: TRANSFORMERS

Transformer keeps the position of the inputs and outputs by using a position-wise embedding mechanism. Attention is all you need use sine and cosine functions in which the position of the input is a multiplier on the size of sin amplitude and the dimensions of the embeddings (e.g. each of 128 dimensions of a word vector is the different frequencies of the sine function)

So the transformer is basically an encoder-decoder with the positional embedding of inputs and fully connected NN that learns the alignments of input and output sequences. The transformer is simply a step higher than its previous generations by simply substitute the RNN layers with a new structure.

Attention

Attention layers are the core blocks of the transformer. They change the embeddings of each token by taking into account the rest of the input tokens, according to the formula:

Multi-head attention consists of :

Linear layers, split into heads.
Scaled dot-product attention.
Concatenation of heads.
Final linear layer

Mapping of the query(Q), keys(K), values(V), and output is in all vector formats. The output is calculated by the weighted sum of the values. The weight that’s assigned to each value is calculated by a compatibility function of the query with the corresponding key.

The Scaled Dot Product Attention is applied to every head and in each attention step masking is used. The attention output for each head is then concatenated and put through a final Dense layer.

Instead of a single attention head, query vector, key vector, and value vector are divided into multiple heads that are created by multiplying embedding with the 3 matrices which are created during training. This allows the model to jointly attend to information at different positions from different representational spaces. The cost of computation of reduced dimensionality of each head after splitting is the same as a single head’s attention with full dimensionality.

Setup

Adam optimizer is used with a custom learning rate schedule. This denotes that the learning rate increases linearly for the first warmup_steps training steps, and decreases thereafter the maximum epochs at 400 with early- stop strategy. Hidden vectors’ size is set to 512. The model is trained on 750 examples and validated on40 examples. The loss metric used here is a cross-entropy loss for predicting one word at a time at the decoder and appending it to the output.

Result

Rouge scores are used as the standard metric for the evaluation and inference of the summarizer. It works by comparing an automatically produced summary or translation against a set of human-generated reference summaries.