Chinese speech synthesis is the application of speech synthesis to the Chinese language (usually Standard Chinese). It poses additional difficulties due to the Chinese characters (which frequently have different pronunciations in different contexts), the complex prosody,
which is essential to convey the meaning of words, and sometimes the
difficulty in obtaining agreement among native speakers concerning what
the correct pronunciation is of certain phonemes.
Concatenation (Ekho and KeyTip)
Recordings
can be concatenated in any desired combination, but the joins sound
forced (as is usual for simple concatenation-based speech synthesis)
and this can severely affect prosody; these synthesizers are also
inflexible in terms of speed and expression. However, because these
synthesizers do not rely on a corpus, there is no noticeable degradation
in performance when they are given more unusual or awkward phrases.
Ekho is an open source TTS which simply concatenates sampled syllables. It currently supports Cantonese, Mandarin, and experimentally Korean. Some of the Mandarin syllables have been pitched-normalised in Praat. A modified version of these is used in Gradint's "synthesis from partials".
cjkware.com used to ship a product called KeyTip Putonghua Reader
which worked similarly; it contained 120 Megabytes of sound recordings
(GSM-compressed to 40 Megabytes in the evaluation version), comprising
10,000 multi-syllable dictionary words plus single-syllable recordings
in 6 different prosodies (4 tones, neutral tone, and an extra third-tone
recording for use at the end of a phrase).
Lightweight synthesizers (eSpeak and Yuet)
The lightweight open-source speech project eSpeak, which has its own approach to synthesis, has experimented with Mandarin and Cantonese. eSpeak was used by Google Translate from May 2010 until December 2010.
The commercial product "Yuet" is also lightweight (it is intended to be suitable for resource-constrained environments like embedded systems); it was written from scratch in ANSI C starting from 2013. Yuet claims a built-in NLP
model that does not require a separate dictionary; the speech
synthesised by the engine claims clear word boundaries and emphasis on
appropriate words. Communication with its author is required to obtain a
copy.
Both eSpeak and Yuet can synthesis speech for Cantonese and
Mandarin from the same input text, and can output the corresponding
romanisation (for Cantonese, Yuet uses Yale and eSpeak uses Jyutping; both use Pinyin
for Mandarin). eSpeak does not concern itself with word boundaries
when these don't change the question of which syllable should be spoken.
Corpus-based
A
"corpus-based" approach can sound very natural in most cases but can
err in dealing with unusual phrases if they can't be matched with the
corpus. The synthesiser engine is typically very large (hundreds or even thousands of megabytes) due to the size of the corpus.
iFlyTek
Anhui USTC iFlyTek Co., Ltd (iFlyTek) published a W3C paper in which they adapted Speech Synthesis Markup Language
to produce a mark-up language called Chinese Speech Synthesis Markup
Language (CSSML) which can include additional markup to clarify the
pronunciation of characters and to add some prosody information.
The amount of data involved is not disclosed by iFlyTek but can be
seen from the commercial products that iFlyTek have licensed their
technology to; for example, Bider's SpeechPlus
is a 1.3 Gigabyte download, 1.2 Gigabytes of which is used for the
highly compressed data for a single Chinese voice. iFlyTek's
synthesiser can also synthesise mixed Chinese and English text with the
same voice (e.g. Chinese sentences containing some English words); they
claim their English synthesis to be "average".
The iFlyTek corpus appears to be heavily dependent on Chinese characters, and it is not possible to synthesize from pinyin
alone. It is sometimes possible by means of CSSML to add pinyin to the
characters to disambiguate between multiple possible pronunciations, but
this does not always work.
NeoSpeech
There is an online interactive demonstration for NeoSpeech speech synthesis, which accepts Chinese characters and also pinyin if it's enclosed in their proprietary "VTML" markup.
Mac OS
Mac OS had Chinese speech synthesizers available up to version 9. This was removed in 10.0 and reinstated in 10.7 (Lion).
Historical corpus-based synthesizers (no longer available)
A corpus-based approach was taken by Tsinghua University in SinoSonic, with the Harbin dialect
voice data taking 800 Megabytes. This was planned to be offered as a
download but the link was never activated. Nowadays, references to it
can be found only on Internet Archive.
Bell Labs' approach, which was demonstrated online in 1997 but
subsequently removed, was described in a monograph "Multilingual
Text-to-Speech Synthesis: The Bell Labs Approach" (Springer, October 31,
1997, ISBN 978-0-7923-8027-6),
and the former employee who was responsible for the project, Chilin
Shih (who subsequently worked at the University of Illinois) put some
notes about her methods on her website.