Speech synthesis
A text-to-speech system is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called
text normalization,
pre-processing, or
tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called
text-to-phoneme or
grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the
synthesizer—then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the
target prosody, which is then imposed on the output speech.