NLP systems work like a well-oiled machine on standard benchmarks. However, their performance is not on par when deployed in real-world systems. A recent study highlighted that 60% of NLP answers are similar to their training data, which implies that the models are simply memorising their training set. Many researchers are working on testing the robustness of these NLP models, on a diverse range of challenges such as adversarial attacks and rule-based data transformations.


In a recent paper, Salesforce, in association with Stanford University, has introduced a new framework – Robustness Gym, that will work as a robustness evaluation toolkit for NLP models. The paper touches on the key challenges associated with assessing NLP systems, such as the paradox of choice, idiomatic lock-in, and workflow fragmentation; and how Robustness Gym enables practitioners to compare results with just a few clicks.

Challenges faced by the practitioners while evaluating the NLP models.

We embed Robustness Gym in a new paradigm for continually evaluating models with the – ‘contemplate, create and consolidate’ evaluation loop,” said the authors. Explaining the process, the researchers stated that during the ‘contemplate’ stage, the Robustness Gym (RG) guides the practitioners on what evaluation to run next, whereas, in the ‘create’ stage the RG slices the data defining the collection of samples for evaluation. However, on the ‘consolidate’ stage, it provides findings for faster iteration and community sharing.

How Does It Work?

Robustness Gym assists practitioners on how variables like the structure of the task, evaluation requirements like a testing generalisation, biases, or security, and constraint of resources can aid in evaluation tasks. Further, the framework supports evaluation on four standard paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks. The RG users can also consolidate the slices into a TestBench, allowing them to collaboratively build benchmarks and track progress.

To demonstrate it with a real case study, the researchers outlined three users’ varying expertise while evaluating a natural language inference model using Robustness Gym. The experiment showed that while the novice users relied on predefined test benches for direct evaluation, intermediate users were creating new slices using SliceBuilders available in the framework, and then constructing their own test benches. Finally, the advanced users were able to use their expertise to add custom slices. “All of these users can generate a shareable Robustness Report,” added researchers.

Researchers further validated the Triple-C process using a 3-hour use case with Salesforce’s commercial sentiment modelling team, where the goal was to measure their model’s bias. Researchers wrote, “We tested their system on 172 slices spanning three evaluation idioms, finding performance degradation on 12 slices of up to 18% (create).” Finally, the researchers generated a single testbench and robustness report for the team, that summarises the findings. The team highlighted that RG was easy to use, and is planning to integrate it into their workflow.

Robustness Report

See Also

Download our Mobile App

The paper also noted that Robustness Gym could be used to conduct new research analyses with ease. To validate it, the team performed a study with a named entity linking (NEL) system and critical analysis of summarisation models. They compared the commercial APIs from Microsoft, Google and Amazon to open-source systems Bootleg, WAT and REL across two benchmark datasets: Wikipedia and AIDA-CoNLL.

The result showed that commercial systems, mentioned above, struggled to link rare entities and lag their academic counterparts by 10%+, while the summarisation models struggled on examples that require abstraction and distillation, degrading by 9%+. However, Microsoft outperforms other commercial systems, while Bootleg displays the most consistent performance across various slices.

Wrapping Up

According to researchers, the Robustness Gym has been developed as an evaluation toolkit for NLP models that supports a broad set of evaluation idioms. It can be used for collaboratively building and sharing evaluations and results. To address practitioners’ challenges today, the team has embedded the framework – Robustness Gym into the Triple-C evaluation loop. “Our results suggest that Robustness Gym is a promising tool for researchers and practitioners,” concluded the researchers.

Read the entire paper here.

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.

Sejuti currently works as Senior Technology Journalist at Analytics India Magazine (AIM). Reach out at [email protected]

Please click here to read the original article as posted on Analytics India Magazine.

We source the web to bring you best Salesforce articles for our reader’s convenience. If you want to have this article removed, please follow guidelines at Digital Millennium Copyright Act (DMCA)..