Let the machine summarize The Meetings

Daksha singhal
6 min readFeb 11, 2021

Forgot to take minutes of the meeting? No Problem.

Introduction

Imagine this, while interacting with your bosses or co-workers for the POA of the next financial quarter on a video call, the summary of the conversation just shows up on your desk. That’s exactly what this article is all about.

It deals with the encapsulating of meetings by summarizing it which is within the style of written dialogues built upon extractive summaries by generating new vocabulary words apart from a text corpus.

Approach

Image By Author

Dataset

The dialogue dataset that was used for the project is the Switchboard Dialogue Act Corpus.

It is a corpus of 1,155 5-minute telephone conversations between 2 persons, annotated with speech act tags. In these conversations, callers question receivers on provided topics, such as child care, recycling, and news media. 440 users participate in these 1,155 conversations, producing 221,616 utterances (we combine consecutive utterances by the same person into one utterance, so our corpus has 122,646 utterances).

Converting Raw Dialogues to Text

Life’s hard and so is preprocessing of Raw Data

Image By Author

CRF Modeling

CRF models the conditional probability and the decision boundary of different classes based on the context of a particular dialogue act and uttered sequence. All the Dialogue Acts(DA) labels will have different transition probability.

It is often challenging to guess the DA of dialogue without knowing what the context of previous and followed dialogues is. For example, it’s difficult to say if “Paris” is to be told as an answer or just a location, So for this, there should be an understanding of the last dialogue whether it was a question or not.

CRF tags depend on all the dialogues context assuming that the features are dependent on each other for learning to make predictions much better.

CRF labelling is used to remove the redundant tags which do not otherwise contribute to the summary like non-verbal tags, acknowledgements etc and also to identify labels as questions and answers.

Anaphora Resolution

Hugging face coreference system in operation. Try it for yourself!

As we are dealing with dialogue format data, it’s essential to distinguish when there are references to other speakers in the dialogues.

To resolve this coreference problem, anaphora resolution can be implemented of the dialogue formatted data and resolving them into a text formatted one based on CRF labelled tags. This is considered as a pronoun resolution task which is a problem when we have more than 2 speakers.

To proceed with anaphora resolution, the following preprocessing steps are carried out.

  • Eliminating brackets and it's content
  • Removing punctuations and symbols
  • Expanding contractions
  • Concatenating speaker name with its corresponding dialogue using CRF labelling

The quality depends on the spoken dialogue and context of the dialogue associated with each dialogue. For this, take help from CRF tag labelling to get the context.

Proposed Model: TRANSFORMERS

https://arxiv.org/abs/1706.03762

Transformer keeps the position of the inputs and outputs by using a position-wise embedding mechanism. Attention is all you need use sine and cosine functions in which the position of the input is a multiplier on the size of sin amplitude and the dimensions of the embeddings (e.g. each of 128 dimensions of a word vector is the different frequencies of the sine function)

So the transformer is basically an encoder-decoder with the positional embedding of inputs and fully connected NN that learns the alignments of input and output sequences. The transformer is simply a step higher than its previous generations by simply substitute the RNN layers with a new structure.

Attention

Attention layers are the core blocks of the transformer. They change the embeddings of each token by taking into account the rest of the input tokens, according to the formula:

https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762

Multi-head attention consists of :

  • Linear layers, split into heads.
  • Scaled dot-product attention.
  • Concatenation of heads.
  • Final linear layer

Mapping of the query(Q), keys(K), values(V), and output is in all vector formats. The output is calculated by the weighted sum of the values. The weight that’s assigned to each value is calculated by a compatibility function of the query with the corresponding key.

The Scaled Dot Product Attention is applied to every head and in each attention step masking is used. The attention output for each head is then concatenated and put through a final Dense layer.

Instead of a single attention head, query vector, key vector, and value vector are divided into multiple heads that are created by multiplying embedding with the 3 matrices which are created during training. This allows the model to jointly attend to information at different positions from different representational spaces. The cost of computation of reduced dimensionality of each head after splitting is the same as a single head’s attention with full dimensionality.

Setup

Adam optimizer is used with a custom learning rate schedule. This denotes that the learning rate increases linearly for the first warmup_steps training steps, and decreases thereafter the maximum epochs at 400 with early- stop strategy. Hidden vectors’ size is set to 512. The model is trained on 750 examples and validated on40 examples. The loss metric used here is a cross-entropy loss for predicting one word at a time at the decoder and appending it to the output.

Result

Rouge scores are used as the standard metric for the evaluation and inference of the summarizer. It works by comparing an automatically produced summary or translation against a set of human-generated reference summaries.

Using LSTM-RNN’S summarized text had a lot of repeated words due to which it has lesser rouge score compared to the transformer

Using the Transformer model the summarized text had relatively less new vocabulary getting generated due to the smaller amount of training data.

References

[1] Attention is all you need: Yu-Hsiang Huang.

[2] Prakhar Ganesh and Saket Dingliwal. 2019. Abstractive
summarization of spoken and written conversation...

[3] “Get to the point: Summarization With Pointer-generator networks”

[4] Just News It: Abstractive Text Summarization with a
Pointer-Generator Transformer Vrinda Vasavada, Alexandre Bucquet

[5] Speaker-change Aware CRF for Dialogue Act Classification
Speaker-change Aware CRF for Dialogue Act Classification Guokan
Shang1,2, Antoine J.-P. Tixier1, Michalis Vazirgiannis1,3,
Jean-Pierre Lorr

[6] Automatic Dialogue Summary Generation for Customer Service:
Chunyi Liu, Peng Wang, Jiang Xu, Zang Li, Jieping Ye

[7] Automatic Chinese Dialogue Text Summarization Based On LSA and
Segmentation: Chuanhan Liu, Yongcheng Wang, Fei Zheng, and
Derong Liu

--

--

Daksha singhal

Logical problem solving has always intrigued me and this explains my interests in mathematics, programming and computing in general.