SEATTLE, July 29, 2021 /PRNewswire/ -- DefinedCrowd, the
one-stop-shop for high-quality artificial intelligence training
data, today released the first of a series of free
Spanish-accented English speech datasets to allow AI developers
to test how well their speech recognition models
understand nonnative English speakers, a demographic
represented by over 35 million people in the United States.
"There is an accent gap in speech technology. Research
shows that speech recognition technologies are not nearly as
accurate in understanding nonnative accents as they are in
understanding white, non-immigrant, upper-middle-class
Americans," said Dr. Daniela
Braga, founder and CEO of DefinedCrowd.
It is not a surprising phenomenon; it is this demographic that
had access to and trained the technology from the beginning. To
address the bias present in speech recognition
technology, DefinedCrowd has released the first
of four sets of Spanish-accented English speech datasets,
which developers can use to test or benchmark their models to
identify bias and areas which need more training data.
"Unfortunately, it has resulted in models that are more useful
to some people than to others. And that must change," said Dr.
Braga.
However, many companies do not have the resources to train or
test their systems with different accents, meaning that
speech recognition systems are likely to provide an
unresponsive, inaccurate, and even isolating experience to
nonnative English speakers.
This is clearly bad for business: according to the U.S. Census,
over 35 million people in the United
States are native speakers of a language other than English.
Sixty percent of these people speak Spanish at home.
"For companies with AI solutions to compete in the large
nonnative English-speaking market in the U.S., speech models need
to be able to understand a wide range of different Spanish accents,
originating from all the Americas," said Christopher Shulby,
Director of Machine Learning
Engineering at DefinedCrowd.
The first dataset, released in two
phases, includes Spanish-accented English data from the
Americas, including Argentina, Brazil, Canada, Chile, Colombia, Dominican
Republic, Guatemala,
Honduras, Mexico, Nicaragua, Panama, Peru,
the United States, Uruguay and Venezuela.
Subsequent releases will include datasets from native
Spanish speakers from around the world,
including Australia, China,
Finland, France, Germany, India, Israel, Italy, Norway, Portugal, Russia, Spain, Sweden, and the United
Kingdom.
The datasets represent speakers aged from 18 - 40, with an equal
distribution of male and female speakers.
To access the data, developers will need to register
on DefinedCrowd's Marketplace here, after which they will
receive a link to download the dataset.
Contact:
pr@definedcrowd.com
Related Images
free-speech-dataset.png
Free Speech Dataset
View original content to download
multimedia:https://www.prnewswire.com/news-releases/mind-the-accent-gap-definedcrowd-contributing-to-more-inclusive-speech-technology-301344593.html
SOURCE DefinedCrowd Corp.