Speech synthesis is a technology that generates artificial speech by mechanical and electronic means, and TTS technology (also known as text-to-speech conversion technology) is part of speech synthesis, which is a technology that transforms textual information generated by a computer itself or by external input into an understandable and fluent spoken output.

TTS speech synthesis technology is one of the key technologies to realize human-computer speech communication. Computers with human-like speaking ability are an important competitive market for the information industry in this era. Compared with speech recognition ASR, the technology of speech synthesis is relatively more mature and is a technology with a wider range of applications.

With the rapid development of artificial intelligence industry, speech synthesis system has been more widely used. In addition to the initial clarity and intelligibility of speech synthesis, people have higher and higher requirements for the naturalness, rhythm and sound quality of speech synthesis. The quality of speech library is also a key factor to determine the effect of speech synthesis.

[Chinese Mandarin Female Corpus] The voices of contributors are gentle and warm in standard mandarin, delivering positive feelings to listeners. All resources are recorded by professional equipment in professional and unchanged studio, with the SNR less than 35dB; mono recording with 48KHz 16-bit sampling frequency in pcm or wav format.
Our corpus is sourced from a variety of data types, involving news, novel, sci-tech, entertainment, and dialogue. The design of our corpus is based on comprehensive linguistic data, as part of our efforts to cover all syllabic consonants, types, tones, links, and prosody. We also work to annotate prosodic hierarchy



Technical Parameters

  • Content:
    Chinese Mandarin Female Database
  • Source:
    A comprehensive corpus covers syllabic consonants, types, tones, links, and prosody.
  • Time:
    12 hours
  • Average word count:
    16 words
  • Language:
  • Speaker:
    Female, 20-30, elegant and optimistic voice
  • Environment:
    Professional recording studio: 1) In line with professional standards 2)unchanged recording environment and equipment 3) SNR less than 35dB
  • Equipment:
    Professional recording equipment and software
  • Format:
    Uncompressed pcm or wav format, sampling rate of 48Hz, 16bit
  • Tagging content:
    Sound-word proofreading, rhyme labeling, boundary labeling of Chinese vowels and rhymes
  • Format:
    Tagged text save as .txt; boundary labeling text save as .interval
  • Standards:
    1. Save audio file as wav format with 48KHz 16bit, unchanged tone color, volume and speed, and without zero drift or waveform clipping.
    2. The word accuracy of annotated text is high than 99.8%.
    3. The proportion of phoneme boundary errors great than 10ms is less than 1%; the accuracy of syllable boundary is higher than 98%.
  • Storage:
  • Format:
    Audio file: WAV; Text annotation file: TXT; Boundary annotation file: INTERVAL
