Humanizing Phrase Error Price for ASR Transcript Readability and Accessibility

[ad_1]

Podcasting has grown to be a preferred and highly effective medium for storytelling, information, and leisure. With out transcripts, podcasts could also be inaccessible to people who find themselves hard-of-hearing, deaf, or deaf-blind. Nonetheless, guaranteeing that auto-generated podcast transcripts are readable and correct is a problem. The textual content must precisely mirror the that means of what was spoken and be straightforward to learn. The Apple Podcasts catalog accommodates thousands and thousands of podcast episodes, which we transcribe utilizing automated speech recognition (ASR) fashions. To judge the standard of our ASR output, we evaluate a small variety of human-generated, or reference, transcripts to corresponding ASR transcripts.

The business customary for measuring transcript accuracy, phrase error fee (WER), lacks nuance. It equally penalizes all errors within the ASR textual content—insertions, deletions, and substitutions—no matter their impression on readability. Moreover, the reference textual content is subjective: It’s primarily based on what the human transcriber discerns as they hearken to the audio.

Constructing on current analysis into higher readability metrics, we set ourselves a problem to develop a extra nuanced quantitative evaluation of the readability of ASR passages. As proven in Determine 1, our answer is the human analysis phrase error fee (HEWER) metric. HEWER focuses on main errors, people who adversely impression readability, corresponding to misspelled correct nouns, capitalization, and sure punctuation errors. HEWER ignores minor errors, corresponding to filler phrases (“um,” “yeah,” “like”) or alternate spellings (“okay” vs. “okay”). We discovered that for an American English check set of 800 segments with a mean ASR transcript WER of 9.2% sampled from 61 podcast episodes, the HEWER was simply 1.4%, indicating that the ASR transcripts had been of upper high quality and extra readable than WER may recommend.

Determine 1: Phrase error fee (WER) considers all tokens as errors of equal weight, whereas human analysis phrase error fee (HEWER) considers solely a few of these tokens as main errors—errors that change the that means of the textual content, have an effect on its readability, or misspell correct nouns—in calculating the metric.

Our findings present data-driven insights that we hope have laid the groundwork for enhancing the accessibility of Apple Podcasts for thousands and thousands of customers. As well as, Apple engineering and product groups can use these insights to assist join audiences with extra of the content material they search.

Deciding on Pattern Podcast Segments

We labored with human annotators to determine and classify errors in 800 segments of American English podcasts pulled from manually transcribed episodes with WER of lower than 15%. We selected this WER most to make sure the ASR transcripts in our analysis samples:

  • Met the edge of high quality we anticipate for any transcript proven to an Apple Podcasts viewers
  • Required our annotators to spend not more than 5 minutes to categorise errors as main or minor

Of the 66 podcast episodes in our preliminary dataset, 61 met this criterion, representing 32 distinctive podcast exhibits. Determine 2 exhibits the choice course of.

diagram of the podcast segment selection phases
Determine 2: We discovered that for an American English check set of 800 segments with a mean ASR transcript WER of 9.2% sampled from 61 podcast episodes, the HEWER was simply 1.4%, indicating that the ASR transcripts had been of upper high quality and extra readable than WER may recommend.

For instance, one episode within the preliminary dataset from the podcast present Yo, Is This Racist? titled “Cody’s Marvel dot Ziglar (with Cody Ziglar)” had a WER of 19.2% and was excluded from our analysis. However we included an episode titled “I am Not Attempting to Put the Plantation on Blast, However…” from the identical present, with a WER of 14.5%.

Segments with a comparatively greater episode WER had been weighted extra closely within the choice course of, as a result of such episodes can present extra insights than episodes whose ASR transcripts are almost flawless. The imply episode WER throughout all segments was 7.5%, whereas the typical WER of the chosen segments was 9.2%. Every audio phase was roughly 30 seconds in period, offering sufficient context for annotators to know the segments with out making the duty too taxing. Additionally we, aimed to pick out segments that began and ended at a phrase boundary, corresponding to a sentence break or lengthy pause.

Evaluating Main and Minor Errors in Transcript Samples

WER is a extensively used measurement of the efficiency of speech recognition and machine translation methods. It divides the whole variety of errors within the auto-generated textual content by the whole variety of phrases within the human-generated (reference) textual content. Sadly, WER scoring provides equal weight to all ASR errors—insertions, substitutions, and deletions—which could be deceptive. For instance, a passage with a excessive WER should be readable and even indistinguishable in semantic content material from the reference transcript, relying on the kinds of errors. 

Earlier analysis on readability has centered on subjective and imprecise metrics. For instance, of their paper “A Metric for Evaluating Speech Recognizer Output Based mostly on Human Notion Mannequin,” Nobuyasu Itoh and staff devised a scoring rubric on a scale of 0 to five, with 0 being the best high quality. Members of their experiment had been first introduced with auto-generated textual content with out corresponding audio and had been requested to evaluate transcripts primarily based on how straightforward the transcript was to know. They then listened to the audio and scored the transcript primarily based on perceived accuracy.

Different readability analysis—for instance, “The Way forward for Phrase Error Price”—has to not our information been applied throughout any datasets at scale. To handle these limitations, our researchers developed a brand new metric for measuring readability, HEWER, that builds on the WER scoring system.

The HEWER rating gives human-centric insights contemplating readability nuances. Determine 3 exhibits three variations of a 30-second pattern phase from transcripts of the April 23, 2021, episode, “The Herd,” of the podcast present This American Life.

Determine 3: Relying on the kind of errors, a passage with a excessive WER should be readable and even indistinguishable from the reference transcript. HEWER gives human-centric insights that contemplate these readability nuances. The HEWER calculation additionally takes into consideration errors not thought-about by WER, corresponding to punctuation or capitalization errors.

Our dataset comprised 30-second audio segments from a superset of 66 podcast episodes, and every phase’s corresponding reference and model-generated transcripts. Human annotators began by figuring out errors in wording, punctuation, or within the transcripts, and classifying as “main errors” solely these errors that:

  • Modified the that means of the textual content
  • Affected the readability of the textual content
  • Misspelled correct nouns

WER and HEWER are calculated primarily based on an alignment of the reference and model-generated textual content. Determine 3 exhibits every metric’s scoring of the identical output. WER counts errors as all phrases that differ between the reference and model-generated textual content, however ignores case and punctuation. HEWER, however, takes each case and punctuation into consideration, and subsequently, the whole variety of tokens, proven within the denominator, is bigger as a result of every punctuation mark counts as a token.

Not like WER, HEWER ignores minor errors, corresponding to filler phrases like “uh” solely current within the reference transcript, or using “till” within the model-generated textual content instead of “till” within the reference transcript. Moreover, HEWER ignores variations in comma placement that don’t have an effect on readability or that means, in addition to lacking hyphens. The one main errors within the Determine 3 HEWER pattern are “quarantine” instead of “quarantining” and “Antibirals” instead of “Antivirals.”

On this case, WER is considerably excessive, at 9.4%. Nonetheless, that worth provides us a misunderstanding in regards to the high quality of the model-generated transcript, which is definitely fairly readable. The HEWER worth of two.2% appears to point that it’s a higher reflection of the human expertise of studying the transcript.

Conclusion

Given the rigidity and limitations of WER, the established business customary for measuring ASR accuracy, we’re working in the direction of constructing on current analysis and create HEWER, a extra nuanced quantitative evaluation of the readability of ASR passages. We utilized this new metric to a dataset of pattern segments from auto-generated transcripts of podcast episodes to glean insights into transcript readability and assist guarantee the best accessibility and very best expertise for all Apple Podcasts audiences and creators.

Acknowledgments

Many individuals contributed to this analysis, together with Nilab Hessabi, Sol Kim, Filipe Minho, Issey Masuda Mora, Samir Patel, Alejandro Woodward Riquelme, João Pinto Carrilho Do Rosario, Clara Bonnin Rossello, Tal Singer, Eda Wang, Anne Wootton, Regan Xu, and Phil Zepeda.

Apple Sources

Apple Newsroom. 2024. “Apple Introduces Transcripts for Apple Podcasts.” [link.]

Apple Podcasts. n.d. “Limitless Matters. Endlessly Participating.” [link.]

Exterior References

Glass, Ira, host. 2021. “The Herd.” This American Life. Podcast 736, April 23, 58:56. [link.]

Hughes, John. 2022. “The Way forward for Phrase Error Price (WER).” Speechmatics. [link.]

Itoh, Nobuyasu, Gakuto Kurata, Ryuki Tachibana, and Masafumi Nishimura. 2015. “A Metric for Evaluating Speech Recognizer Output primarily based on Human-Notion Mannequin.” In sixteenth Annual Convention of the Worldwide Speech Communication Affiliation (Interspeech 2015). Speech Past Speech: In the direction of a Higher Understanding of the Most Essential Biosignal, 1285–88. [link.]

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *