Acquiring high-quality labeled data has been a long-standing challenge for the data labeling industry. Challenges include:
- Diverse Needs: Every ML project has its own specific requirements.
- Rising Standards: As models get better, they need better-quality data.
- Identifying Proficient Labelers: Lack of visibility in discerning the traits that make an effective labeler.
With that, a key question for our team was: How might we improve quality in data labeling, taking into account different needs of projects, rising standards and the need to identify proficient labelers?
On Measuring Quality
The starting point for improving quality was figuring out how to measure it. We couldn’t move a needle that didn’t exist yet.
Initially, we approached quality by identifying what it isn’t: inaccurate labels. The goal was to minimize labeling errors as much as possible based on specific use cases.
However, how do we define inaccurate labels? On what basis? The answer was quite simple in hindsight. Every data labeling project came with a set of instructions from the client that’s the source of truth for labelers working on the projects.
The rules for what makes a label wrong came from the instructions given by the client. These instructions are like the ultimate guide for people doing the labeling. By comparing the labels to the instructions and spotting any mistakes, we could figure out how well an individual or a whole project was doing, and this helped us know where to make improvements.
Example: Categorizing Mistakes for Image Annotation
Image annotation involves assigning labels to pixels in an image in different use cases. Hence, mistakes could be divided into the following general categories based on prior research:
- Misdrawn Annotations: Annotations with bad boundaries e.g. too tight or loose.
- Mislabeled Annotations: Annotations with the wrong label e.g. cat labeled as a dog
- Extra Annotations: Unnecessary or additional annotations that don’t fit project instructions
- Missing Annotations: Annotations that should have been drawn but were not
The ability to identify the count of the above mistakes and to trace it to individual labelers in image annotation projects, will enable us to measure their quality.
Introducing the Accuracy Scorecard
With that, this led to the conception of the Accuracy Scorecard. Think of this as a precise ledger where we recorded mistakes made in image annotation projects, based on the aforementioned mistakes. We used a simple formula to do this where:
Application of the scorecard gave us a clear view of performance at both the project and individual level.
We were also able to identify areas of improvement for individual data labelers by observing the types of mistakes made. This also extended to evaluating their understanding of the project instructions, e.g. many extra/missing mistakes typically denoted that they did not fully understand the project requirements.
Digging deeper, understanding the types of mistakes being made allowed us to perform trend analysis and trace quality challenges to their root cause. This exercise enabled us to track down breakdowns in tooling, process and the instructions.
The Result
Implementing the scorecard across multiple projects proved a resounding success. Labelers working on the projects were able to pinpoint the types of mistakes they were making and those with lower scores began asking for feedback en-masse.
This resulted in challenging projects that were previously stagnant in quality starting to slowly but steadily improve in accuracy over time. The ripple effect of this reverberated across the organization as well, with new initiatives such as project training programs or new product features using the scorecard data as a benchmark to perform A/B testing.
All in all, the scorecard proved an effective tool for the organization to rally and improve the quality for our projects. This proved an effective springboard as the project ecosystem in data labeling is starting to become more complex.