Understanding Evaluation Metrics for NLP Tasks

16 min readJan 14, 2022

"If the metrics you are looking at aren't useful in optimizing your strategy - stop looking at them."- Mark Twain

In this article, I will be discussing about some of the evaluation metrics used in DL models for various NLP related tasks. We will look into how each works with an example and how to code it up. Let’s start.

Please note below the commonly used terms and their meanings used in this article:

Reference translation/sentence/word/token - Human translation/sentence/word/token
Candidate translation/sentence/word/token -Machine/Model translation/sentence/word/token

A) For classification related language tasks (like identifying review sentiment)

1. Recall, Precision, F1 Score

Fig 1: Definition of precision, recall, accuracy

Fig 2: Example of precision, recall, accuracy

Precision(P): ratio of sentences correctly predicted to the total sentences predicted for a class.

Recall(R): ratio of sentences correctly predicted to the total sentences belonging to a class.

F-Score: harmonic mean of precision and recall (PR/(P+R))

import numpy as npclass precision_recall_f1score:
    def __init__(self, labels=None):
        self.labels = labels
        self.classwise_correct_pred = {c: 0 for c in self.labels}
        self.classwise_pred = {c: 0 for c in self.labels}
        self.classwise_target = {c: 0 for c in self.labels}
        self.classwise_prec_recall_f1 = {c: {'precision': 0, 'recall': 0, 'f1': 0} for c in self.labels}
        self.global_tp = 0
        self.global_fp = 0
        self.global_fn = 0
        self.micro_avg_f1, self.micro_prec, self.micro_recall, self.macro_f1, self.macro_precision, self.macro_recall, self.wgt_f1, self.wgt_precision, self.wgt_recall = 0,0,0,0,0,0,0,0,0def update(self, preds, targets):
        try:
            top_pred = preds.argmax(1, keepdim = True)
            self.global_tp += top_pred.eq(targets.view_as(top_pred)).sum().item()
            for c in self.classwise_correct_pred.keys():
                self.classwise_correct_pred[c] += (top_pred[top_pred.eq(targets.view_as(top_pred))]==c).sum().float()
                self.classwise_pred[c] += (top_pred==c).sum().float()
                self.classwise_target[c] +=(targets==c).sum().float()
        except Exception as e:
            raise e
    
    def calculate(self):
        try:
            for k in self.labels:
                self.classwise_prec_recall_f1[k]['precision'] = (self.classwise_correct_pred[k]/self.classwise_pred[k]).item()
                self.classwise_prec_recall_f1[k]['recall'] = (self.classwise_correct_pred[k]/self.classwise_target[k]).item()
                self.classwise_prec_recall_f1[k]['f1'] = (self.classwise_prec_recall_f1[k]['precision'] * self.classwise_prec_recall_f1[k]['recall'])/(self.classwise_prec_recall_f1[k]['precision'] + self.classwise_prec_recall_f1[k]['recall'] + 1e-20)self.global_fp += self.classwise_pred[k] - self.classwise_correct_pred[k]
                self.global_fn += self.classwise_target[k] - self.classwise_correct_pred[k]self.micro_prec = self.global_tp/(self.global_tp+self.global_fp.item())
            self.micro_recall = self.global_tp/(self.global_tp+self.global_fn.item())
            self.micro_avg_f1 = (self.micro_prec * self.micro_recall) / (self.micro_prec + self.micro_recall + 1e-20)self.macro_precision = np.average([self.classwise_prec_recall_f1[k]['precision'] for k in self.classwise_correct_pred.keys()])
            self.macro_recall = np.average([self.classwise_prec_recall_f1[k]['recall'] for k in self.classwise_correct_pred.keys()])
            self.macro_f1 = np.average([self.classwise_prec_recall_f1[k]['f1'] for k in self.classwise_correct_pred.keys()])weights = [v.item() for v in self.classwise_target.values()]
            self.wgt_precision = np.average([self.classwise_prec_recall_f1[k]['precision'] for k in self.classwise_correct_pred.keys()], weights=weights)
            self.wgt_recall = np.average([self.classwise_prec_recall_f1[k]['recall'] for k in self.classwise_correct_pred.keys()], weights=weights)
            self.wgt_f1 = np.average([self.classwise_prec_recall_f1[k]['f1'] for k in self.classwise_correct_pred.keys()], weights=weights)return self.classwise_prec_recall_f1, self.micro_avg_f1, self.micro_prec, self.micro_recall, self.macro_f1, self.macro_precision, self.macro_recall, self.wgt_f1, self.wgt_precision, self.wgt_recall
        except Exception as e:
            raise e

Training Logs of Precision, Recall, F1 Score with explanation

Epoch: 08 | Epoch Time: 0m 1s
  Train Loss: 0.512 | Train Acc: 80.45%
  Val. Loss: 2.123 
  Val Metric: Precision, Recall, F1
  ======================================
  Class | Precision | Recall | F1
  ======================================
  positive | 0.3987538814544678 | 0.436860054731369 | 0.20846904884291376
  negative | 0.43919509649276733 | 0.5516483783721924 | 0.24452021829408901
  neutral | 0.2540540397167206 | 0.2326732724905014 | 0.1214470265542746
  very positive | 0.49799197912216187 | 0.4542124569416046 | 0.23754789602675389
  very negative | 0.40528634190559387 | 0.2067415714263916 | 0.13690476100518295
  ======================================
  Micro Average F1 Score: 0.2018606024808033
  Macro Average F1 Score: 0.18977779014464283
  Weighted Average F1 Score: 0.19786720630547747
Epoch: 09 | Epoch Time: 0m 1s
  Train Loss: 0.367 | Train Acc: 86.41%
  Val. Loss: 2.589 
  Val Metric: Precision, Recall, F1
  ======================================
  Class | Precision | Recall | F1
  ======================================
  positive | 0.42064371705055237 | 0.4311717748641968 | 0.21292134245934963
  negative | 0.43939393758773804 | 0.38241758942604065 | 0.20446533651249416
  neutral | 0.2233918160200119 | 0.31518152356147766 | 0.13073237709661886
  very positive | 0.49900200963020325 | 0.45787546038627625 | 0.23877746320975812
  very negative | 0.36201781034469604 | 0.27415731549263 | 0.1560102352539963
  ======================================
  Micro Average F1 Score: 0.1904902539870053
  Macro Average F1 Score: 0.1885813509064434
  Weighted Average F1 Score: 0.19262911587987164
Epoch: 10 | Epoch Time: 0m 1s
  Train Loss: 0.270 | Train Acc: 89.91%
  Val. Loss: 3.245 
  Val Metric: Precision, Recall, F1
  ======================================
  Class | Precision | Recall | F1
  ======================================
  positive | 0.42147117853164673 | 0.48236632347106934 | 0.22493368817608342
  negative | 0.4451901614665985 | 0.4373626410961151 | 0.22062084471733523
  neutral | 0.24074074625968933 | 0.3217821717262268 | 0.13771186502373733
  very positive | 0.5513513684272766 | 0.37362638115882874 | 0.22270742904316226
  very negative | 0.38562092185020447 | 0.26516854763031006 | 0.15712383893443077
  ======================================
  Micro Average F1 Score: 0.19772593030124036
  Macro Average F1 Score: 0.1926195331789498
  Weighted Average F1 Score: 0.1988935131090743

As we can see Class-wise Precision and Recall and F1 is increasing along with Micro, Macro, Weighted Average F1 score. This means that the model is improving as it is able to predict more accurately and recalling is improved.

B) For non-classification related language tasks (like sentence translation)

Please note: English to German translation is considered as an example over here using Multi30K.

1. BLEU Score

Description

BLEU-Bilingual Evaluation Understudy is a string matching algorithm.

Input:

Target/Reference Sentence (x): A man sleeping in a green room on a couch.
Predicted/Candidate Sentence (x_hat): A man is sleeping on a green room on a couch .

Procedure

To calculate BLEU Score,

Calculate N-gram for N=1 to 4(n_max=4 in my code):

An n-gram is a sequence of words occurring within a given window where n represents the window size. For both reference and candidate sentence, 1-gram,2-gram,3-gram and 4-gram are generated.

N-gram for reference sentence:

Fig 3: N-gram(from n=1 to 4) for reference sentence used in the example

N-gram for candidate sentence:

Fig 4: N-gram(from n=1 to 4) for candidate sentence used in the example

2. Calculate Modified N-gram Precision

i) Calculate Clip Counts: For each N-gram,i.e., N=1 to 4,
a) Count the maximum number of times a candidate n-gram occurs in any single reference sentence; this is referred to as Count.
b) For each reference sentence, count the number of times a candidate n-gram occurs. As we have three reference translations, we calculate, Ref 1 count, Ref2 count, and Ref 3 count.
c) Take the maximum number of n-grams occurrences in any reference count. Also known as Max Ref Count.
d) Take the minimum of the Count and Max Ref Count. Also known as Count clip as it clips the total count of each candidate word by its maximum reference count.

e) Add all these clipped counts.
ii) Calculate precision:
i) Finally, divide the clipped counts by the total (unclipped) number of candidate n-grams to get the modified precision score.

3. Calculate Brevity Penalty:

It penalizes too short translations. It will be 1 when candidate sentence length is equal to reference sentence length.
where r- count of words in a reference sentence,
c- count of words in a candidate sentence

4. Calculate BLEU Score

BLEU is calculated using the below formula:

where BP- brevity penalty,
N: No. of n-grams, we usually use unigram, bigram, 3-gram, 4-gram,
wₙ: Weight for each modified precision, by default N is 4, wₙ is 1/4=0.25,
Pₙ: Modified precision
BLEU metric ranges from 0 to 1. BLEU score will be 1 when the predicted sentence is identical to one of the target sentence.

Code

Base Code

def bleu_score(candidate_corpus, references_corpus, max_n=4, weights=[0.25] * 4):
    """Computes the BLEU score between a candidate translation corpus and a references
    translation corpus. Based on https://www.aclweb.org/anthology/P02-1040.pdfArguments:
        candidate_corpus: an iterable of candidate translations. Each translation is an
            iterable of tokens
        references_corpus: an iterable of iterables of reference translations. Each
            translation is an iterable of tokens
        max_n: the maximum n-gram we want to use. E.g. if max_n=3, we will use unigrams,
            bigrams and trigrams
        weights: a list of weights used for each n-gram category (uniform by default)Examples:
        >>> from torchtext.data.metrics import bleu_score
        >>> candidate_corpus = [['My', 'full', 'pytorch', 'test'], ['Another', 'Sentence']]
        >>> references_corpus = [[['My', 'full', 'pytorch', 'test'], ['Completely', 'Different']], [['No', 'Match']]]
        >>> bleu_score(candidate_corpus, references_corpus)
            0.8408964276313782
    """assert max_n == len(weights), 'Length of the "weights" list has be equal to max_n'
    assert len(candidate_corpus) == len(references_corpus),\
        'The length of candidate and reference corpus should be the same'clipped_counts = torch.zeros(max_n)
    total_counts = torch.zeros(max_n)
    weights = torch.tensor(weights)candidate_len = 0.0
    refs_len = 0.0for (candidate, refs) in zip(candidate_corpus, references_corpus):
        candidate_len += len(candidate)# Get the length of the reference that's closest in length to the candidate
        refs_len_list = [float(len(ref)) for ref in refs]
        refs_len += min(refs_len_list, key=lambda x: abs(len(candidate) - x))reference_counters = _compute_ngram_counter(refs[0], max_n)
        for ref in refs[1:]:
            reference_counters = reference_counters | _compute_ngram_counter(ref, max_n)candidate_counter = _compute_ngram_counter(candidate, max_n)clipped_counter = candidate_counter & reference_countersfor ngram in clipped_counter:
            clipped_counts[len(ngram) - 1] += clipped_counter[ngram]for ngram in candidate_counter:  # TODO: no need to loop through the whole counter
            total_counts[len(ngram) - 1] += candidate_counter[ngram]if min(clipped_counts) == 0:
        return 0.0
    else:
        pn = clipped_counts / total_counts
        log_pn = weights * torch.log(pn)
        score = torch.exp(sum(log_pn))bp = math.exp(min(1 - refs_len / candidate_len, 0))return bp * score.item()

Implementation

from torchtext.data.metrics import bleu_scoredef calculate_bleu(model, max_n=4):
    
    trgs = []
    pred_trgs = []data = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    
    for datum in data:
        
        src, trg = datum
        
        pred_trg = translate_sentence(model, src)
        
        #cut off <eos> token
        pred_trg = pred_trg[:-1]
        
        pred_trgs.append(pred_trg.strip().split(' '))
        trgs.append([token_transform[TGT_LANGUAGE](trg.strip())])
        
    return bleu_score(pred_trgs, trgs, max_n=max_n)

2. Perplexity

Description

Perplexity means agitated/entangled state. Agitation or Randomness of a system is measured by entropy of the system. Entropy is the average number of bits to encode the information contained in a random variable. To represent 16 numbers in binary format, we would need log2(16)=4 bits(entropy). Perplexity is exponentiation of entropy, which is 2entropy = 24 = 16. So, the exponentiation of the entropy should be the total amount of all possible information, or, the weighted average number of choices a random variable has.

Entropy is defined as H(p) =-Σ p(x) log p(x)

Perplexity is P = e^(H(p)) = e^(-Σ p(x) log p(x))

Input:

Target/Reference Sentence (x): A man sleeping in a green room on a couch.
Predicted/Candidate Sentence (x_hat): A man is sleeping on a green room on a couch .

Procedure

To calculate Perplexity,

Calculate Categorical Cross-entropy as C
Calculate Perplexity as np.exp(C)

Code

model.eval()
losses = 0val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)for src, tgt in val_dataloader:
    src = src.to(DEVICE)
    tgt = tgt.to(DEVICE)tgt_input = tgt[:-1, :]src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)
    
    tgt_out = tgt[1:, :]
    loss = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
    losses += loss.item()val_loss = losses / len(val_dataloader)perplexity = np.exp(val_loss)

3. BERT Score

Description

Input:

Target/Reference Sentence (x): A man sleeping in a green room on a couch.
Predicted/Candidate Sentence (x_hat): A man is sleeping on a green room on a couch .

Procedure

To calculate BERT Score,

Convert Target/Reference Sentence and Predicted/Candidate Sentence to Contextual Embeddings:

Both the sentences are tokenized.
Target/Reference Tokens: [‘A’, ‘man’, ‘sleeping’, ‘in’, ‘a’, ‘green’, ‘room’, ‘on’, ‘a’, ‘couch’, ‘.’]
Predicted/Candidate Tokens: [‘A’, ‘man’, ‘is’, ‘sleeping’, ‘on’, ‘a’, ‘green’, ‘room’, ‘on’, ‘a’, ‘couch’, ‘.’]
Both of these are then passed through BERT/ELMo or similar embedding model to obtain contextual embeddings represented by vectors. BERT model is used to perform tokenization and contextual embedding generation task. BERT, which tokenizes the input text into a sequence of word pieces, where unknown words are split into several commonly observed sequences of characters. The representation for each word piece is computed with a Transformer encoder by repeatedly applying self-attention and nonlinear transformations in an alternating fashion. The output generated by BERT model is embeddings represented by below matrices. Each embedding of a sentence is a collection of vectors where each vector is contextual representation of each token in sentence.
Target/Reference Contextual Embeddings: X
Predicted/Candidate Contextual Embeddings: X_hat

2. Calculate Pairwise Cosine Similarity among Contextual Embeddings Vector:

Each vector in embedding is represented as below.
Target/Reference Token Vector: Xi
Predicted/Candidate Token Vector: X_hatj
For each vector Xi in Reference Tokens, each X_hatj vector from candidate tokens is considered one by one. For each pair of (Xi , X_hatj), cosine similarity is calculated as below.
Cosine similarity = (XiT * X_hatj) / (||Xi|| * ||X_hatj||)
Since the vectors Xi and X_hatj are normalized, the calculation of cosine similarity reduces to (XiT * X_hatj)

Calculate BERT Score:

i) Calculate Precision:
For each token vector X_hatj in candidate tokens, maximum of cosine similarity score of X_hatj with all Xi reference tokens is taken as the similarity measure for that X_hatj token vector. To calculate precision for the entire sentence:
a) With Importance Weighing: Previous work on similarity measures demonstrated that rare words can be more indicative for sentence similarity than common words. Inverse document frequency(IDF) is used to measure the importance of words. IDF scores are computed on corpus. Then weighted average of all cosine similarity of X_hatj with their IDF score is calculated. This average is the precision for the sentence.
b) Without Importance Weighing: Average of all cosine similarity of X_hatj is calculated. This average is the precision for the sentence.

ii) Calculate Recall:
For each token vector Xi in refernce tokens, maximum of cosine similarity score of Xi with all X_hatj reference tokens is taken as the similarity measure for that Xi token vector. To calculate recall for the entire sentence:
a) With Importance Weighing: Previous work on similarity measures demonstrated that rare words can be more indicative for sentence similarity than common words. Inverse document frequency(IDF) is used to measure the importance of words. IDF scores are computed on corpus. Then weighted average of all cosine similarity of Xi with their IDF score is calculated. This average is the recall for the sentence.
b) Without Importance Weighing: Average of all cosine similarity of Xi is calculated. This average is the recall for the sentence.

iii) Calculate F1 Score: F1 score is the harmonic ean of above calculated precision and recall.

Target/Reference Contextual Embeddings: X
Predicted/Candidate Contextual Embeddings: X_hat

Code

Base Code

def score(
    cands,
    refs,
    model_type=None,
    num_layers=None,
    verbose=False,
    idf=False,
    device=None,
    batch_size=64,
    nthreads=4,
    all_layers=False,
    lang=None,
    return_hash=False,
    rescale_with_baseline=False,
    baseline_path=None,
):
    """
    BERTScore metric.
    Args:
        - :param: `cands` (list of str): candidate sentences
        - :param: `refs` (list of str or list of list of str): reference sentences
        - :param: `model_type` (str): bert specification, default using the suggested
                  model for the target langauge; has to specify at least one of
                  `model_type` or `lang`
        - :param: `num_layers` (int): the layer of representation to use.
                  default using the number of layer tuned on WMT16 correlation data
        - :param: `verbose` (bool): turn on intermediate status update
        - :param: `idf` (bool or dict): use idf weighting, can also be a precomputed idf_dict
        - :param: `device` (str): on which the contextual embedding model will be allocated on.
                  If this argument is None, the model lives on cuda:0 if cuda is available.
        - :param: `nthreads` (int): number of threads
        - :param: `batch_size` (int): bert score processing batch size
        - :param: `lang` (str): language of the sentences; has to specify
                  at least one of `model_type` or `lang`. `lang` needs to be
                  specified when `rescale_with_baseline` is True.
        - :param: `return_hash` (bool): return hash code of the setting
        - :param: `rescale_with_baseline` (bool): rescale bertscore with pre-computed baseline
        - :param: `baseline_path` (str): customized baseline file
    Return:
        - :param: `(P, R, F)`: each is of shape (N); N = number of input
                  candidate reference pairs. if returning hashcode, the
                  output will be ((P, R, F), hashcode). If a candidate have 
                  multiple references, the returned score of this candidate is 
                  the *best* score among all references.
    """
    assert len(cands) == len(refs), "Different number of candidates and references"assert lang is not None or model_type is not None, "Either lang or model_type should be specified"ref_group_boundaries = None
    if not isinstance(refs[0], str):
        ref_group_boundaries = []
        ori_cands, ori_refs = cands, refs
        cands, refs = [], []
        count = 0
        for cand, ref_group in zip(ori_cands, ori_refs):
            cands += [cand] * len(ref_group)
            refs += ref_group
            ref_group_boundaries.append((count, count + len(ref_group)))
            count += len(ref_group)if rescale_with_baseline:
        assert lang is not None, "Need to specify Language when rescaling with baseline"if model_type is None:
        lang = lang.lower()
        model_type = lang2model[lang]
    if num_layers is None:
        num_layers = model2layers[model_type]tokenizer = get_tokenizer(model_type)
    model = get_model(model_type, num_layers, all_layers)
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)if not idf:
        idf_dict = defaultdict(lambda: 1.0)
        # set idf for [SEP] and [CLS] to 0
        idf_dict[tokenizer.sep_token_id] = 0
        idf_dict[tokenizer.cls_token_id] = 0
    elif isinstance(idf, dict):
        if verbose:
            print("using predefined IDF dict...")
        idf_dict = idf
    else:
        if verbose:
            print("preparing IDF dict...")
        start = time.perf_counter()
        idf_dict = get_idf_dict(refs, tokenizer, nthreads=nthreads)
        if verbose:
            print("done in {:.2f} seconds".format(time.perf_counter() - start))if verbose:
        print("calculating scores...")
    start = time.perf_counter()
    all_preds = bert_cos_score_idf(
        model,
        refs,
        cands,
        tokenizer,
        idf_dict,
        verbose=verbose,
        device=device,
        batch_size=batch_size,
        all_layers=all_layers,
    ).cpu()if ref_group_boundaries is not None:
        max_preds = []
        for beg, end in ref_group_boundaries:
            max_preds.append(all_preds[beg:end].max(dim=0)[0])
        all_preds = torch.stack(max_preds, dim=0)use_custom_baseline = baseline_path is not None
    if rescale_with_baseline:
        if baseline_path is None:
            baseline_path = os.path.join(os.path.dirname(__file__), f"rescale_baseline/{lang}/{model_type}.tsv")
        if os.path.isfile(baseline_path):
            if not all_layers:
                baselines = torch.from_numpy(pd.read_csv(baseline_path).iloc[num_layers].to_numpy())[1:].float()
            else:
                baselines = torch.from_numpy(pd.read_csv(baseline_path).to_numpy())[:, 1:].unsqueeze(1).float()all_preds = (all_preds - baselines) / (1 - baselines)
        else:
            print(
                f"Warning: Baseline not Found for {model_type} on {lang} at {baseline_path}", file=sys.stderr,
            )out = all_preds[..., 0], all_preds[..., 1], all_preds[..., 2]  # P, R, Fif verbose:
        time_diff = time.perf_counter() - start
        print(f"done in {time_diff:.2f} seconds, {len(refs) / time_diff:.2f} sentences/sec")if return_hash:
        return tuple(
            [
                out,
                get_hash(model_type, num_layers, idf, rescale_with_baseline, use_custom_baseline=use_custom_baseline,),
            ]
        )return out

Implementation

from bert_score import score
from transformers import logginglogging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)def calculate_bert_score(model):
    
    trgs = []
    pred_trgs = []data = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    
    for datum in data:
        
        src, trg = datum
        
        pred_trg = translate_sentence(model, src)
        
        #cut off <eos> token
        pred_trg = pred_trg[:-1]
        
        pred_trgs.append(pred_trg.strip())
        trgs.append([trg.strip()])P, R, F1 = score(pred_trgs, trgs, lang="en", verbose=False, batch_size=BATCH_SIZE)
    P, R, F1 = P.mean(), R.mean(), F1.mean()
        
    return P, R, F1

Training Logs of BLEU Score, Perplexity and Bert Score with Explanation

Training Logs

Epoch: 1:- Train loss: 5.321 | Val PPL: 61.466 | Val BLEU Score: 0.036 | Val BERT Score: Precision - 0.844, Recall - 0.871, F1 Score - 0.857 || Epoch time = 41.566s
Epoch: 2:- Train loss: 3.768 | Val PPL: 28.040 | Val BLEU Score: 0.107 | Val BERT Score: Precision - 0.882, Recall - 0.896, F1 Score - 0.889 || Epoch time = 44.592s
Epoch: 3:- Train loss: 3.163 | Val PPL: 18.227 | Val BLEU Score: 0.166 | Val BERT Score: Precision - 0.898, Recall - 0.908, F1 Score - 0.903 || Epoch time = 43.619s
Epoch: 4:- Train loss: 2.771 | Val PPL: 13.789 | Val BLEU Score: 0.204 | Val BERT Score: Precision - 0.906, Recall - 0.917, F1 Score - 0.911 || Epoch time = 44.338s
Epoch: 5:- Train loss: 2.481 | Val PPL: 11.615 | Val BLEU Score: 0.234 | Val BERT Score: Precision - 0.913, Recall - 0.924, F1 Score - 0.918 || Epoch time = 44.738s

Explanation

Note: Reference sentence means Target/Human annotated text. Candidate Sentence means Model generated text.

1) Perplexity

Epoch: 1:- Train loss: 5.321 | Val PPL: 61.466 || Epoch time = 41.566s
Epoch: 2:- Train loss: 3.768 | Val PPL: 28.040 || Epoch time = 44.592s
Epoch: 3:- Train loss: 3.163 | Val PPL: 18.227 || Epoch time = 43.619s
Epoch: 4:- Train loss: 2.771 | Val PPL: 13.789 || Epoch time = 44.338s
Epoch: 5:- Train loss: 2.481 | Val PPL: 11.615 || Epoch time = 44.738s

We can see a diminishing perplexity trend which indicates that the model is learning and is becoming and lesser confused. Perplexity is exponentiation of entropy. Lesser the perplexity, lesser the entropy(randomness) of the model. Thus, a better model at predicting.

2) BLEU Score

Epoch: 1:- Train loss: 5.321 | Val BLEU Score: 0.036 || Epoch time = 41.566s
Epoch: 2:- Train loss: 3.768 | Val BLEU Score: 0.107 || Epoch time = 44.592s
Epoch: 3:- Train loss: 3.163 | Val BLEU Score: 0.166 || Epoch time = 43.619s
Epoch: 4:- Train loss: 2.771 | Val BLEU Score: 0.204 || Epoch time = 44.338s
Epoch: 5:- Train loss: 2.481 | Val BLEU Score: 0.234 || Epoch time = 44.738s

We can see an increasing BLEU Score trend. BLEU score is product of brevity penality(BP),which is best match length, and exponentiation of summation of weighted 1-gram, 2-gram, 3-gram, 4-gram presence count(count of n-grams present in both reference and candidate sentence) known as modified precision. As BLEU score increases, BP and precision increases. Higher BP means candidate translation will match the reference translations in length, in word choice, and word order. Higher precision means 1-gram, 2-gram, 3-gram and 4-gram of candidate are occurring more in number in reference sentences. Thus, predicted and target sentences are becoming similar and hence model is improving.

3) BERT Score

Epoch: 1:- Train loss: 5.321 | Val BERT Score: Precision - 0.844, Recall - 0.871, F1 Score - 0.857 || Epoch time = 41.566s
Epoch: 2:- Train loss: 3.768 | Val BERT Score: Precision - 0.882, Recall - 0.896, F1 Score - 0.889 || Epoch time = 44.592s
Epoch: 3:- Train loss: 3.163 | Val BERT Score: Precision - 0.898, Recall - 0.908, F1 Score - 0.903 || Epoch time = 43.619s
Epoch: 4:- Train loss: 2.771 | Val BERT Score: Precision - 0.906, Recall - 0.917, F1 Score - 0.911 || Epoch time = 44.338s
Epoch: 5:- Train loss: 2.481 | Val BERT Score: Precision - 0.913, Recall - 0.924, F1 Score - 0.918 || Epoch time = 44.738s

We can see an increasing precision, recall and f1 score of BERT Score trend. BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity to calculate precision, recall and F1 score.

For precision, each token in candidate sentence is matched by cosine similarity with all the tokens of reference sentence pairwise and the pair with maximum similarity becomes the precision score for that token. Mean of all the precision scores becomes precision of the model for that sentence. Precision scores for dataset is calculated in batches and their mean is the Precision of the model shown here. Increasing precision means increasing individual precision scores which in turn means increasing cosine similarity among reference and candidate tokens. Increasing cosine similarity means predicted and target sentences have words which are becoming more similar in nature. Thus, model is becoming better at generating text similar in nature to target text.

For recall, each token in reference sentence is matched by cosine similarity with all the tokens of candidate sentence pairwise and the pair with maximum similarity becomes the recall score for that token. Mean of all the recall scores becomes precision of the model for that sentence. Recall scores for dataset is calculated in batches and their mean is the recall of the model shown here. Increasing recall means increasing individual recall scores which in turn means increasing cosine similarity among reference and candidate tokens. Increasing cosine similarity means predicted and target sentences have words which are becoming more similar in nature. Thus, model is becoming better at remembering context from target text.

For F1 Score, harmonic mean of precision score and recall score for each batch is calculated. F1 scores for model, which is shown here, is mean of the F1 score of all the batches in the dataset. Increasing f1 means better precision and recall of the model. Thus, model is becoming not only better at generating text similar to target text but also becoming better at remembering context from target text.

Comparison of Evaluation Metrics(BLEU, Perplexity and BERT Score)

Contacts

If you want to keep updated with my latest articles and projects follow me on Medium, YouTube and GitHub. These are some of my contacts details:

Understanding Evaluation Metrics for NLP Tasks

A) For classification related language tasks (like identifying review sentiment)

1. Recall, Precision, F1 Score

B) For non-classification related language tasks (like sentence translation)

1. BLEU Score

Description

Code

2. Perplexity

Code

3. BERT Score

Code

Training Logs of BLEU Score, Perplexity and Bert Score with Explanation

Training Logs

Explanation

1) Perplexity

2) BLEU Score

3) BERT Score

Comparison of Evaluation Metrics(BLEU, Perplexity and BERT Score)

Contacts

References

Written by GARIMA MAHATO