Copyright ©1999 - 2002 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
The Voice Browser Working Group has sought to develop standards to enable access to the web using spoken interaction. The Speech Synthesis Markup Language Specification is part of this set of new markup specifications for voice browsers, and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate and etc. across different synthesis-capable platforms.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.
This is a working draft of the "Speech Synthesis Markup Language Specification". You are encouraged to subscribe to the public discussion list <[email protected]> and to mail in your comments as soon as possible. To subscribe, send an email to <mailto:[email protected]> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). A public archive is available online.
This specification describes markup for generating synthetic speech via a speech synthesizer, and forms part of the proposals for the W3C Speech Interface Framework. This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only).
The previous draft of this specification was published as a Last Call Working Draft in January of 2001. Over the past year the Voice Browser Working Group has not focused its attention on this specification, but now it is ready to make more active and timely progress on the specification. The Working Group has meanwhile made progress on other specifications, such as the Speech Recognition Grammar Format and the VoiceXML 2.0 specification. These are related to the SSML specification, and in some areas depend on this specification.
In order to coordinate the advancements of these specification along the W3C track to Recommendation, the Working Group felt that it was necessary to update the SSML specification with changes necessary to support the VoiceXML specification. Due to changes in the state of the art of the technology of speech synthesis during this timeframe, the Working Group felt it would be appropriate to make a new release of the specification, with a small number of changes, as a Working Draft. The expectation and goal are that it will be possible to release another draft after that one as a Last Call Working Draft because the Working Group will have focused sufficient attention on the specification for it to be technically sound in today's world.
Following the publication of the previous draft of this specification, the group received a number of public comments. Those comments have not been addressed in this current Working Draft but will be addressed in the timeframe of the Last Call Working Draft. Commenters who have sent their comments to the public mailing list need not resubmit their comments in order for them to be addressed at that time.
To help the Voice Browser working group build an implementation report, (as part of advancing the document on the W3C Recommendation Track), you are encouraged to implement this specification and to indicate to W3C which features have been implemented, and any problems that arose.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Working Drafts as other than "work in progress". A list of current public W3C Working Drafts can be found at http://www.w3.org/TR/.
The W3C Standard is known as the Speech Synthesis Markup Language specification and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [JSML].
The Speech Synthesis Markup Language specification is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate and etc. across different synthesis-capable platforms.
There is some variance in the use of technical vocabulary in the
speech synthesis community. The following definitions establish a
common understanding for this document.
Voice Browser | A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities. |
Speech Synthesis | The process of automatic generation of speech output from data input which may include plain text, formatted text or binary objects. |
Text-To-Speech | The process of automatic generation of speech output from text or annotated text input. |
The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages published December 23, 1999 by the W3C Voice Browser Working Group.
The following items were the key design criteria.
A Text-To-Speech (TTS) system that supports the Speech Synthesis Markup Language will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.
Document creation: A text document provided as input to the TTS system may be produced automatically, by human authoring, or through a combination of these forms. The Speech Synthesis markup language defines the form of the document.
Document processing: The following are the six major processing steps undertaken by a TTS system to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output.
XML Parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.
Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.
- Markup support: The "paragraph" and "sentence" elements defined in the TTS markup language explicitly indicate document structures that affect the speech output.
- Non-markup behavior: In documents and parts of documents where these elements are not used, the TTS system is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.
Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the TTS system that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on.
- Markup support: The "say-as" element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked includes dates, times, numbers, acronyms, current amounts and more. The set covers many of the common constructs that require special treatment across a wide number of languages but is not and cannot be a complete set.
- Non-markup behavior: For text content that is not marked with the "say-as" element the TTS system is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different systems to render the same document differently.
Text-to-phoneme conversion: Once the system has determined the set of words to be spoken it must convert those words to a string of phonemes. A phoneme is the basic unit of sound in a language. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes. In many languages this conversion is ambiguous since the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English TTS system will often have trouble determining how to speak some non-English-origin names; e.g. "Tlalpachicatl" which has a Mexican/Aztec origin.
- Markup support: The "phoneme" element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The "say-as" element may also be used to indicate that text is a proper name that may allow a TTS system to apply special rules to determine a pronunciation.
- Non-markup behavior: In the absence of a "phoneme" element the TTS system must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary and applying rules to determine other pronunciations. Most TTS systems are expert at performing text-to-phoneme conversions so most words of most documents can be handled automatically.
Prosody analysis: Prosody is the set of features of
speech output that includes the pitch (also called intonation or
melody), the timing (or rhythm), the pausing, the speaking rate,
the emphasis on words and many other features. Producing human-like
prosody is important for making speech sound natural and for
correctly conveying the meaning of spoken language.
- Markup support: The "emphasis"
element, "break" element and "prosody" element may all be used by document
creators to guide the TTS system in generating
appropriate prosodic features in the speech output.
- Non-markup behavior: In the absence of these elements, TTS systems are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.
Waveform production: The phonemes and prosodic information are used by the TTS system in the production of the audio waveform. There are many approaches to this processing step so there may be considerable platform-specific variation.
- Markup support: The TTS markup does not provide explicit controls over the generation of waveforms. The "voice" element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The "audio" element allows for insertion of recorded audio data into the output stream.
There are many classes of document creator that will produce marked-up documents to be spoken by a TTS system. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.
The document creator has no access to information to mark up the text. All processing steps in the TTS system must be performed fully automatically on raw text. The document requires only the containing "speak" element to indicate the content is to be spoken.
When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, text normalization, prosody and possibly text-to-phoneme conversion.
Some document creators make considerable effort to mark as many details of the document to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and voice browser applications may be fine-tuned to maximize the effectiveness of the overall system.
The most advanced document creators may skip the higher-level markup (structure, text normalization, text-to-phoneme conversion, and prosody analysis) and produce low-level TTS markup for segments of documents or for entire documents. This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.
The following are important instances of architectures or designs from which marked-up TTS documents will be generated. The language design is intended to facilitate each of these approaches.
Dialog language: It is a requirement that it should be possible to include documents marked with the speech synthesis markup language into the dialog description document to be produced by the Voice Browser Working Group.
Interoperability with Aural CSS : Any HTML processor that is Aural CSS-enabled can produce Speech Synthesis Markup Language. ACSS is covered in Section 19 of the Cascading Style Sheets, level 2 (CSS2) Specification (12-May-1998). This usage of speech synthesis facilitates improved accessibility to existing HTML and XHTML content.
Application-specific style-sheet processing: As mentioned above, there are classes of application that have knowledge of text content to be spoken and this can be incorporated into the speech synthesis markup to enhance rendering of the document. In many cases, it is expected that the application will use style-sheets to perform transformations of existing XML documents to speech synthesis markup. This is equivalent to the use of ACSS with HTML and once again the speech synthesis markup language is the "final form" representation to be passed to the speech synthesis engine. In this context, SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.
The Speech Synthesis Markup Language Specification provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate and etc. Exact specification of synthetic speech output behavior across disparate platforms, however, is beyond the scope of this document.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.
The following elements are defined in this draft specification.
The Speech Synthesis Markup Language is an XML application. The
root element is speak
. xml:lang
is a defined attribute
specifying the language of the root document. The
version
attribute is a required attribute that
indicates the version of the specification to be used for the
document. The version number for this specification is
1.0.
<?xml version="1.0"?> <speak version="1.0" xml:lang="en-US"> ... the body ... </speak>
Following the XML
convention, languages are indicated by an xml:lang
attribute on the enclosing element with the value following [RFC3066] to define language codes. A
language is specified by an RFC 3066 identifier following the
convention of XML 1.0.
[Note: XML 1.0 adopted RFC3066 through Errata as of
2001-02-22].
Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.
xml:lang
is a defined attribute for speak
, paragraph
, sentence
, p
, and s
elements.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> <paragraph>I don't speak Japanese.</paragraph> <paragraph xml:lang="ja">Nihongo-ga wakarimasen.</paragraph> </speak>
The speech output platform largely determines behavior in the
case that a document requires speech output in a language not
supported by the speech output platform. In any case, if a value
for xml:lang
specifying an unsupported language is
encountered, a conforming SSML processor should attempt to continue
processing and should also notify the hosting environment in that
case.
There may be variation across conformant platforms in the
implementation of xml:lang
for different markup
elements (e.g. paragraph
and sentence
elements). A document
author should beware that intra-sentential language changes may not
be supported on all platforms.
A language change often necessitates a change in the voice. Where the platform does not have the same voice in both the enclosing and enclosed languages it should select a new voice with the inherited voice attributes. Any change in voice will reset the prosodic attributes to the default values for the new voice of the enclosed text. Where the "xml:lang" value is the same as the inherited value there is no need for any changes in the voice or prosody.
All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis and break element should each be rendered in a manner that is appropriate to the current language.
A paragraph
element represents the paragraph
structure in text. A sentence
element represents the
sentence structure in text. A paragraph contains zero or more
sentences.
xml:lang
is a defined attribute on both paragraph
and sentence elements.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> <paragraph> <sentence>This is the first sentence of the paragraph.</sentence> <sentence>Here's another sentence.</sentence> </paragraph> </speak>
For brevity, the markup also supports <p> and <s> as exact equivalents of <paragraph> and <sentence>. (Note: XML requires that the opening and closing elements be identical so <p> text </paragraph> is not legal.). Also note that <s> means "strike-out" in HTML 4.0 and earlier, and in XHTML-1.0-Transitional, but not in XHTML-1.0-Strict.
The use of paragraph
and sentence
elements is optional. Where text occurs without an enclosing
paragraph
or sentence
elements the speech
output system should attempt to determine the structure using
language-specific knowledge of the format of plain text.
The say-as
element indicates the type of text
construct contained within the element. This information is used to
help specify the pronunciation of the contained text. Defining a
comprehensive set of text format types is difficult because of the
variety of languages that must be considered and because of the
innate flexibility of written languages. The say-as
element has been specified with a reasonable set of format types.
Text substitution may be utilized for unsupported constructs.
The type
attribute is a required attribute that
indicates the contained text construct. The format is a text type
optionally followed by a colon and a format.
The base set of type values, divided according to broad functionality, is as follows:
acronym
: The contained text is an acronym.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> <say-as type="acronym"> DEC </say-as> </speak> Output: "DEC."
spell-out: The characters in the contained text string are pronounced as individual characters.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> <say-as type="spell-out"> USA </say-as> </speak> Output: "U, S, A".
number
: contained text contains integers,
fractions, floating points, Roman numerals or some other textual
format that can be interpreted and spoken as a number in the
current language. Format values for numbers are:
ordinal
, where the contained text should be
interpreted as an ordinal. The content may be a digit sequence or
some other textual format that can be interpreted and spoken as an
ordinal in the current language; cardinal
, where
the contained text should be interpreted as a cardinal. The content
may be a digit sequence or some other textual format that can be
interpreted and spoken as a cardinal in the current
language; and digits
, where the contained text
is to be read as a digit sequence, rather than as a number.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> Rocky <say-as type="number"> XIII </say-as> </speak> Output: "Rocky thirteen." <?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> Pope John the <say-as type="number:ordinal"> VI </say-as> <!-- Pope John the sixth --> Deliver to <say-as type="number:digits"> 123 </say-as> Brookwood. </speak> Output: "Deliver to one two three Brookwood."
date
: contained text is a date. Format values for
date input content are:
"dmy", "mdy", "ymd" (day, month , year), (month, day, year), (year, month, day)
"ym", "my", "md" (year, month), (month, year), (month, day)
"y", "m", "d" (year), (month), (day).
time
: contained text is a time of day. Format
values for time input content are:
"hms", "hm", "h" (hours, minutes, seconds), (hours, minutes), (hours).
duration
: contained text is a temporal duration.
Format values for duration input content are:
"hms", "hm", "ms", "h", "m", "s" (hours, minutes, seconds), (hours, minutes), (minutes, seconds), (hours), (minutes), (seconds).
currency
: contained text is a currency amount.
measure
: contained text is a measurement.
telephone
: contained text is a telephone
number.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> <say-as type="date:ymd"> 2000/1/20 </say-as> <!-- January 20th two thousand --> Proposals are due in <say-as type="date:my"> 5/2001 </say-as> <!-- Proposals are due in May two thousand and one --> The total is <say-as type="currency">$20.45</say-as> <!-- The total is twenty dollars and forty-five cents --> </speak>
name
: contained text is a proper name of a person,
company etc.
net
: contained text is an internet identifier.
Format values for internet identifier input content are:
"email", "uri".
address
: contained text is a postal address.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> <say-as type="net:email"> [email protected] </say-as> </speak>
When specified, format values of say-as
attributes
are to be interpreted by the conforming SSML processor as hints
provided by the mark-up document author to aide text normalization
and pronunciation.
In all cases, the text enclosed by any say-as
element is intended to be a standard, orthographic form of the
language currently in context. An SSML processor should be able to
support the common, orthographic forms of the specified language.
In the case of dates for example, <say-as type="date">
2000/1/20 </say-as> may be read as "January twentieth two
thousand" or as "the twentieth of January two thousand" and so
on.
When character(s) designating currency units are included in the enclosed text, the SSML processor should include the units in the rendered output.
When multi-field quantities are specified in the format value attribute ("dmy", "my", etc.), the processor may assume that the fields are separated by a single, non-alphanumeric character. The resulting orthographic form may be language-specific, e.g. using a slash to delimit year, month and day in English.
The phoneme
element provides a phonetic
pronunciation for the contained text. The
"phoneme" element may be empty. However, it is
recommended that the element contain human-readable text that can
be used for non-spoken rendering of the document. For example, the
content may be displayed visually for users with hearing
impairments.
The ph
attribute is a required attribute that
specifies the phoneme string.
The alphabet
attribute is an optional attribute
that specifies the phonetic alphabet. The default value of
alphabet
for a conforming SSML processor is
"ipa", corresponding to characters composing the
International Phonetic Alphabet. In addition to an exhaustive set
of vowel and consonant symbols, IPA supports a syllable delimiter,
numerous diacritics, stress symbols, lexical tone symbols,
intonational markers and more.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> <phoneme alphabet="ipa" ph="tɒmɑtoʊ"> tomato </phoneme> <!-- This is an example of IPA using character entities --> </speak>
If a value for alphabet
specifying an unknown
phonetic alphabet is encountered, a conforming SSML processor
should continue processing and should notify the hosting
environment in that case.
Characters composing many of the International Phonetic Alphabet (IPA) phonemes are known to display improperly on most platforms. Additional IPA limitations include the fact that IPA is difficult to understand even when using ASCII equivalents, IPA is missing symbols required for many of the world's languages, and IPA editors and fonts containing IPA characters are not widely available. The Voice Browser Working Group will address the issue of specifying a more robust phoneme alphabet at a later date.
Entity definitions may be used for repeated pronunciations. For example:
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd" [ <!ENTITY uk_tomato "tɒmɑtoʊ"> ]> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> ... you say <phoneme ph="&uk_tomato;"> tomato </phoneme> I say... </speak>
The sub
element is employed to indicate that the
specified text replaces the contained text for pronunciation. This
allows a document to contain both a spoken and written form. The
required alias
attribute specifies the string to be
substituted for the enclosed string.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> <sub alias="World Wide Web Consortium"> W3C </sub> <!-- World Wide Web Consortium --> </speak>
The "voice" element is a production element that requests a change in speaking voice. Attributes are:
xml:lang
: optional language specification
attribute.
gender
: optional attribute indicating the preferred
gender of the voice to speak the contained text. Enumerated values
are: "male", "female", "neutral".
age
: optional attribute indicating the preferred
age of the voice to speak the contained text. Acceptable values are
of type (integer)
variant
: optional attribute indicating a preferred
variant of the other voice characteristics to speak the contained
text. (e.g. the second or next male child voice). Valid values of
variant
are integers.
name
: optional attribute indicating a
platform-specific voice name to speak the contained text. The value
may be a space-separated list of names ordered from top preference
down.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> <voice gender="female">Mary had a little lamb,</voice> <!-- now request a different female child's voice --> <voice gender="female" variant="2"> It's fleece was white as snow. </voice> <!-- platform-specific voice selection --> <voice name="Mike">I want to be like Mike.</voice> </speak>
When there is not a voice available that exactly matches the attributes specified in the document, or multiple voices that match the criteria, the voice selection algorithm may be platform-specific. In both cases, a conforming SSML processor should continue processing and should notify the hosting environment.
Voice attributes are inherited down the tree including to within elements that change the language.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> <voice gender="female"> Any female voice here. <voice age="6"> A female child voice here. <paragraph xml:lang="ja"> <!-- A female child voice in Japanese. --> </paragraph> </voice> </voice> </speak>
A change in voice resets the prosodic parameters since different voices have different natural pitch and speaking rates. Volume is the only exception.
The xml:lang
attribute may be used specially to
request usage of a voice with a specific dialect or other variant
of the enclosing language.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> <voice xml:lang="en-cockney"> Try a Cockney voice (London area). </voice> <voice xml:lang="en-brooklyn"> Try one with a New York accent. </voice> </speak>
The "emphasis" element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesizer determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:
level
: the level
attribute indicates
the strength of emphasis to be applied. Defined values are
"strong", "moderate",
"none" and "reduced". The default
level is "moderate". The meaning of
"strong" and "moderate" emphasis
is interpreted according to the language being spoken (languages
indicate emphasis using a possible combination of pitch change,
timing changes, loudness and other acoustic differences). The
"reduced" level is effectively the opposite of
emphasizing a word. For example, when the phrase "going to" is
reduced it may be spoken as "gonna". The "none"
level is used to prevent the speech synthesizer from emphasizing
words that it might typically emphasize.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> That is a <emphasis> big </emphasis> car! That is a <emphasis level="strong"> huge </emphasis> bank account! </speak>
The break
element is an empty element that controls
the pausing or other prosodic boundaries between words. The use of
the break element between any pair of words is optional. If the
element is not defined, the speech synthesizer is expected to
automatically determine a break based on the linguistic context. In
practice, the break
element is most often used to
override the typical automatic behavior of a speech synthesizer.
The attributes are:
size
: the size
attribute is an
optional attribute having one of the following relative values:
"none", "small",
"medium" (default value), or
"large". The value "none"
indicates that a normal break boundary should be used. The other
three values indicate increasingly large break boundaries between
words. The larger boundaries are typically accompanied by
pauses.
time
: the time
attribute is an
optional attribute indicating the duration of a pause in seconds or
milliseconds. It follows the "Times" attribute format from the Cascading Style Sheets,
level 2 (CSS2) Specification . e.g. "250ms", "3s".
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> Take a deep breath <break/> then continue. Press 1 or wait for the tone. <break time="3s"/> I didn't hear you! </speak>
Using the size
attribute is generally preferable to
the time
attribute within normal speech. This is
because the speech synthesizer will modify the properties of the
break according to the speaking rate, voice and possibly other
factors. As an example, a fixed 250ms pause (placed with the
time
attribute) sounds much longer in fast speech than
in slow speech.
The prosody
element permits control of the pitch,
speaking rate and volume of the speech output. The attributes
are:
pitch
: the baseline pitch for the contained text in
Hertz, a relative change or values "high",
"medium", "low",
"default".
contour
: sets the actual pitch contour for the
contained text. The format is outlined below.
range
: the pitch range (variability) for the
contained text in Hertz, a relative change or values
"high", "medium",
"low", "default".
rate
: the speaking rate in words-per-minute for the
contained text, a relative change or values
"fast", "medium",
"slow", "default".
duration
: a value in seconds or milliseconds for
the desired time to take to read the element contents. Follows the
Times attribute format from the Cascading Style Sheets, level
2 (CSS2) Specification . e.g. "250ms", "3s".
volume
: the volume for the contained text in the
range 0.0 to 100.0 (higher values are louder and specifying a value
of zero is equivalent to specifying "silent"), a
relative change or values "silent",
"soft", "medium",
"loud" or "default".
Relative changes for any of the attributes above are specified as floating-point values: "+10", "-5.5", "+15.2%", "-8.0%". For the pitch and range attributes, relative changes in semitones are permitted: "+0.5st", "+5st", "-2st".
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> The price of XYZ is <prosody rate="-10%"> <say-as type="currency">$45</say-as></prosody> </speak>
The pitch contour is defined as a set of targets at specified
intervals in the speech output. The algorithm for interpolating
between the targets is platform-specific. In each pair of the form
(interval,target)
, the first value is a percentage of
the period of the contained text and the second value is the value
of the pitch
attribute (absolute, relative, relative
semitone, or descriptive values are all permitted). Interval values
outside 0% to 100% are ignored. If a value is not defined for 0% or
100% then the nearest pitch target is copied.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/10/synthesis"> <prosody contour="(0%,+20)(10%,+30%)(40%,+10)"> good morning </prosody> </speak>
The duration
attribute takes precedence over the
rate
attribute. The contour
attribute
takes precedence over the pitch
and range
attributes.
All prosodic attribute values are indicative. If a conforming speech synthesizer is unable to accurately render a document as specified, (e.g. trying to set the pitch to 1Mhz, or the speaking rate to 1,000,000 words per minute.) it will make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value.
In some cases, SSML processors may elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units may reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.
The default value of all prosodic attributes is no change. For
example, omitting the rate
attribute means that the
rate is the same within the element as outside.
The descriptive values ("high", "medium" etc.) may be specific to the platform, to user preferences or to the current language and voice. As such, it is generally preferable to use the descriptive values or the relative changes over absolute values.
The audio
element supports the insertion of recorded audio files and the insertion of other
audio formats in conjunction with synthesized speech output. The
audio
element may be empty. If the audio
element is not empty then the contents should be the marked-up text
to be spoken if the audio document is not available. The alternate
content may include text, speech markup, or another
audio
element. The alternate contents may also be used
when rendering the document to non-audible output and for
accessibility. The optional attribute is src
, which is
the URI of a document with an appropriate mime-type.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> <!-- Empty element --> Please say your name after the tone. <audio src="beep.wav"/> <!-- Container element with alternative text --> <audio src="prompt.au">What city do you want to fly from?</audio> <audio src="welcome.wav"> <emphasis>Welcome</emphasis> to the Voice Portal. </audio> </speak>
An audio
element is sucessfully rendered if:
Deciding which conditions result in the alternative content
being rendered is platform dependent. If the audio
element is not successfully rendered, a conforming SSML processor
should continue processing and should notify the hosting
environment in that case. An SSML processor may determine after
beginning playback of an audio source that it cannot be played in
its entirety. For example, encoding problems, network disruptions,
etc. may occur. The processor may designate this either as
successful or unsuccessful rendering, but it must document this
behavior.
The audio
element is not intended to be a complete
mechanism for synchronizing synthetic speech output with other
audio output or other output media (video etc.). Instead the
audio
element is intended to support the common case
of embedding audio files in voice output. See the SMIL integration
example in Appendix A.
A mark
element is an element that places a marker
into the text/tag sequence. The mark
element that
contains text is used to reference a special sequence of tags and
text, either for internal reference within the SSML document, or
externally by another document. The empty mark
element
can also be used to reference a specific location in the text/tag
sequence, and can additionally be used to insert a marker into an
output stream for asynchronous notification. When audio output of
the TTS document reaches the mark
, the speech
synthesizer issues an event that includes the required
name
attribute of the element. The platform defines
the destination of the event. The mark
element does
not affect the speech output process.
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> We would like <mark name="congrats">to extend our warmest congratulations</mark> to the members of the Voice Browser Working Group! Go from <mark name="here"/> here, to <mark name="there"/> there! </speak>
When supported by the implementation, requests can be made to
pause and resume at document locations specified by the
mark
values.
A legal Speech Synthesis Markup Language document must have a legal XML Prolog [XML §2.8].
The XML prolog in a synthesis document comprises the XML
declaration and an optional DOCTYPE declaration referencing the
synthesis DTD. It is followed by the root speak
element. The XML prolog may also contain XML comments, processor
instructions and other content permitted by XML in a prolog.
The version number of the XML declaration indicates which
version of XML is being used. The version number of the
speak
element indicates which version of the SSML
specification is being used -- "1.0" for this
specification. The speak
version is a
required attribute.
The speak
element must designate the SSML namespace
using the xmlns attribute [XMLNS]. The
namespace for SSML is defined to be http://www.w3.org/2001/10/synthesis.
If present, the DOCTYPE should reference the standard DOCTYPE and identifier.
The following are two examples of SSML headers:
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">
The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text-editor. See the SMIL/SSML integration examples in Appendix A..
Aural style sheets are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.
The fetching and caching behavior of SSML documents is defined by the environment in which the SSML processor operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.
This section is Normative.
A synthesis document fragment is a Conforming Speech Synthesis Markup Language Fragment if:
<?xml...?>
) is included at the top of the
document,speak
element does not already
designate the synthesis namespace using the "xmlns" attribute, then
xmlns="http://www.w3.org/2001/10/synthesis"
is added
to the element.A document is a Conforming Stand-Alone Speech Synthesis Markup Language Document if:
The Speech Synthesis specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.
The SSML namespace may be used with other XML namespaces as per the Namespaces in XML Recommendation. Future work by W3C will address ways to specify conformance for documents involving multiple namespaces.
A Speech Synthesis Markup Language processor is a program that can parse and process Speech Synthesis Markup Language documents.
In a Conforming Speech Synthesis Markup Language Processor, the XML parser must be able to parse and process all XML constructs defined within XML 1.0 and XML Namespaces.
A Conforming Speech Synthesis Markup Language Processor must correctly understand and apply the semantics defined for each markup element as described by this document.
A Conforming Speech Synthesis Markup Language Processor is required to parse all language declarations successfully.
A Conforming Speech Synthesis Markup Language Processor should inform its hosting environment if it encounters a language that it can not support.
There is no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor.
This document was written with the participation of the members of the W3C Voice Browser Working Group (listed in alphabetical order):
This appendix is Non-Normative.
The following is an example of reading headers of email messages. The paragraph and sentence elements are used to mark the text structure. The say-as element is used to indicate text constructs such as the time and proper name. The break element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The prosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.
<?xml version="1.0"?> <speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en"> <paragraph> <sentence>You have 4 new messages.</sentence> <sentence>The first is from <say-as type="name"> Stephanie Williams </say-as> and arrived at <break/> <say-as type="time">3:45pm</say-as>. </sentence> <sentence> The subject is <prosody rate="-20%">ski trip</prosody> </sentence> </paragraph> </speak>
The following example combines audio files and different spoken voices to provide information on a collection of music.
<?xml version="1.0"?> <speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en"> <paragraph> <voice gender="male"> <sentence>Today we preview the latest romantic music from the W3C.</sentence> <sentence>Hear what the Software Reviews said about Tim Lee's newest hit.</sentence> </voice> </paragraph> <paragraph> <voice gender="female"> He sings about issues that touch us all. </voice> </paragraph> <paragraph> <voice gender="male"> Here's a sample. <audio src="http://www.w3c.org/music.wav"/> Would you like to buy it? </voice> </paragraph> </speak>
The SMIL language is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.
File 'greetings.ssml' contains the following:
<?xml version="1.0"?> <speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en"> <sentence> <mark name="greetings"> <emphasis>Greetings</emphasis> from the <sub alias="World Wide Web Consortium">W3C</sub>! </mark> </sentence> </speak>
SMIL Example 1: W3C logo image appears, and then one second later, the speech sequence is rendered. File 'greetings.smil' contains the following:
<smil xmlns="http://www.w3.org/2001/SMIL20/Language"> <head> <top-layout width="640" height="320"> <region id="whole" width="640" height="320"/> </top-layout> </head> <body> <par> <img src="http://w3clogo.gif" region="whole" begin="0s"/> <ref src="greetings.ssml#greetings" begin="1s"/> </par> </body> </smil>
SMIL Example 2: W3C logo image appears, then clicking on the image causes it to disappear and the speach sequence to be rendered. File 'greetings.smil' contains the following:
<smil xmlns="http://www.w3.org/2001/SMIL20/Language"> <head> <top-layout width="640" height="320"> <region id="whole" width="640" height="320"/> </top-layout> </head> <body> <seq> <img id="logo" src="http://w3clogo.gif" region="whole" begin="0s" end="logo.activateEvent"/> <ref src="greetings.ssml#greetings"/> </seq> </body> </smil>
This appendix is Informative.
The synthesis DTD is located at http://www.w3.org/TR/speech-synthesis/synthesis.dtd.
<?xml version="1.0" encoding="ISO-8859-1"?> <!-- SSML DTD 20020313 Copyright 1998-2002 W3C (MIT, INRIA, Keio), All Rights Reserved. Permission to use, copy, modify and distribute the SSML DTD and its accompanying documentation for any purpose and without fee is hereby granted in perpetuity, provided that the above copyright notice and this paragraph appear in all copies. The copyright holders make no representation about the suitability of the DTD for any purpose. It is provided "as is" without expressed or implied warranty. --> <!ENTITY % duration "CDATA"> <!ENTITY % integer "CDATA"> <!ENTITY % uri "CDATA"> <!ENTITY % audio "#PCDATA | audio "> <!ENTITY % structure "paragraph | p | sentence | s"> <!ENTITY % sentence-elements "break | emphasis | mark | phoneme | prosody | say-as | voice | sub"> <!ENTITY % allowed-within-sentence " %audio; | %sentence-elements; "> <!ENTITY % say-as-types "(acronym|spell-out|currency|measure| name|telephone|address| number|number:ordinal|number:digits|number:cardinal| date|date:dmy|date:mdy|date:ymd| date:ym|date:my|date:md| date:y|date:m|date:d| time|time:hms|time:hm|time:h| duration|duration:hms|duration:hm|duration:ms| duration:h|duration:m|duration:s| net|net:email|net:uri)"> <!ELEMENT speak (%allowed-within-sentence; | %structure;)*> <!ATTLIST speak version NMTOKEN #REQUIRED xml:lang NMTOKEN #IMPLIED xmlns CDATA #REQUIRED xmlns:xsi CDATA #IMPLIED xsi:schemaLocation CDATA #IMPLIED > <!ELEMENT paragraph (%allowed-within-sentence; | sentence | s)*> <!ATTLIST paragraph xml:lang NMTOKEN #IMPLIED > <!ELEMENT sentence (%allowed-within-sentence;)*> <!ATTLIST sentence xml:lang NMTOKEN #IMPLIED > <!ELEMENT p (%allowed-within-sentence; | sentence | s)*> <!ATTLIST p xml:lang NMTOKEN #IMPLIED > <!ELEMENT s (%allowed-within-sentence;)*> <!ATTLIST s xml:lang NMTOKEN #IMPLIED > <!ELEMENT voice (%allowed-within-sentence; | %structure;)*> <!ATTLIST voice xml:lang NMTOKEN #IMPLIED gender (male | female | neutral) #IMPLIED age %integer; #IMPLIED variant %integer; #IMPLIED name CDATA #IMPLIED > <!ELEMENT prosody (%allowed-within-sentence; | %structure;)*> <!ATTLIST prosody pitch CDATA #IMPLIED contour CDATA #IMPLIED range CDATA #IMPLIED rate CDATA #IMPLIED duration %duration; #IMPLIED volume CDATA #IMPLIED > <!ELEMENT audio (%allowed-within-sentence; | %structure;)*> <!ATTLIST audio src %uri; #IMPLIED > <!ELEMENT emphasis (%allowed-within-sentence;)*> <!ATTLIST emphasis level (strong | moderate | none | reduced) "moderate" > <!ELEMENT say-as (#PCDATA)> <!ATTLIST say-as type %say-as-types; #REQUIRED > <!ELEMENT sub (#PCDATA)> <!ATTLIST sub alias CDATA #REQUIRED > <!ELEMENT phoneme (#PCDATA)> <!ATTLIST phoneme ph CDATA #REQUIRED alphabet CDATA "ipa" > <!ELEMENT break EMPTY> <!ATTLIST break size (large | medium | small | none) "medium" time %duration; #IMPLIED > <!ELEMENT mark (%allowed-within-sentence; | %structure;)*> <!ATTLIST mark name ID #REQUIRED >
This appendix is Normative.
The synthesis schema is located at http://www.w3.org/TR/speech-synthesis/synthesis.xsd.
Note: the synthesis schema includes a no-namespace core schema, located at http://www.w3.org/TR/speech-synthesis/synthesis-core.xsd, which may be used as a basis for specifying Speech Synthesis Markup Language Fragments embedded in non-synthesis namespace schemas.
<?xml version="1.0" encoding="ISO-8859-1"?> <xsd:schema targetNamespace="http://www.w3.org/2001/10/synthesis" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.w3.org/2001/10/synthesis" elementFormDefault="qualified"> <xsd:annotation> <xsd:documentation>SSML 1.0 Schema (20020311)</xsd:documentation> </xsd:annotation> <xsd:annotation> <xsd:documentation>Copyright 1998-2002 W3C (MIT, INRIA, Keio), All Rights Reserved. Permission to use, copy, modify and distribute the SSML schema and its accompanying documentation for any purpose and without fee is hereby granted in perpetuity, provided that the above copyright notice and this paragraph appear in all copies. The copyright holders make no representation about the suitability of the schema for any purpose. It is provided "as is" without expressed or implied warranty. </xsd:documentation> </xsd:annotation> <xsd:include schemaLocation="synthesis-core.xsd"/> </xsd:schema>
<?xml version="1.0" encoding="ISO-8859-1"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xsd:annotation> <xsd:documentation>SSML 1.0 Core Schema (20020222)</xsd:documentation> </xsd:annotation> <xsd:annotation> <xsd:documentation>Copyright 1998-2002 W3C (MIT, INRIA, Keio), All Rights Reserved. Permission to use, copy, modify and distribute the SSML core schema and its accompanying documentation for any purpose and without fee is hereby granted in perpetuity, provided that the above copyright notice and this paragraph appear in all copies. The copyright holders make no representation about the suitability of the schema for any purpose. It is provided "as is" without expressed or implied warranty.</xsd:documentation> </xsd:annotation> <xsd:annotation> <xsd:documentation>Importing dependent namespaces</xsd:documentation> </xsd:annotation> <xsd:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/2001/xml.xsd"/> <xsd:annotation> <xsd:documentation>General Datatypes</xsd:documentation> </xsd:annotation> <xsd:simpleType name="duration"> <xsd:annotation> <xsd:documentation>Duration follows "Times" in CCS specification; e.g. "25ms", "3s"</xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:string"> <xsd:pattern value="[0-9]+m?s"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="relative.change"> <xsd:annotation> <xsd:documentation>Relative change: e.g. +10, -5.5, +15%, -9.0%</xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:string"> <xsd:pattern value="[+-][0-9]+(.[0-9]+)?[%]?"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="relative.change.st"> <xsd:annotation> <xsd:documentation>Relative change in semi-tones: e.g. +10st, -5st</xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:string"> <xsd:pattern value="[+-]?[0-9]+st"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="height.scale"> <xsd:annotation> <xsd:documentation>values for height </xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:string"> <xsd:enumeration value="high"/> <xsd:enumeration value="medium"/> <xsd:enumeration value="low"/> <xsd:enumeration value="default"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="number.range"> <xsd:annotation> <xsd:documentation>number range: e.g. 0-123, 23343-223333. No constraint that the second number is greater than the first. </xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:string"> <xsd:pattern value="[0-9]+-.[0-9]+"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="speed.scale"> <xsd:annotation> <xsd:documentation>values for speed </xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:string"> <xsd:enumeration value="fast"/> <xsd:enumeration value="medium"/> <xsd:enumeration value="slow"/> <xsd:enumeration value="default"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="volume.scale"> <xsd:annotation> <xsd:documentation>values for speed </xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:string"> <xsd:enumeration value="silent"/> <xsd:enumeration value="soft"/> <xsd:enumeration value="medium"/> <xsd:enumeration value="loud"/> <xsd:enumeration value="default"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="float.range1"> <xsd:annotation> <xsd:documentation>0.0 - 100.0 </xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:float"> <xsd:minInclusive value="0.0"/> <xsd:maxInclusive value="100.0"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="Say-as.datatype"> <xsd:annotation> <xsd:documentation>say-as datatypes </xsd:documentation> </xsd:annotation> <xsd:restriction base="xsd:string"> <xsd:enumeration value="acronym"/> <xsd:enumeration value="spell-out"/> <xsd:enumeration value="number"/> <xsd:enumeration value="number:ordinal"/> <xsd:enumeration value="number:digits"/> <xsd:enumeration value="number:cardinal"/> <xsd:enumeration value="date"/> <xsd:enumeration value="date:dmy"/> <xsd:enumeration value="date:mdy"/> <xsd:enumeration value="date:ymd"/> <xsd:enumeration value="date:ym"/> <xsd:enumeration value="date:my"/> <xsd:enumeration value="date:md"/> <xsd:enumeration value="date:y"/> <xsd:enumeration value="date:m"/> <xsd:enumeration value="date:d"/> <xsd:enumeration value="time"/> <xsd:enumeration value="time:hms"/> <xsd:enumeration value="time:hm"/> <xsd:enumeration value="time:h"/> <xsd:enumeration value="duration"/> <xsd:enumeration value="duration:hms"/> <xsd:enumeration value="duration:hm"/> <xsd:enumeration value="duration:ms"/> <xsd:enumeration value="duration:h"/> <xsd:enumeration value="duration:m"/> <xsd:enumeration value="duration:s"/> <xsd:enumeration value="currency"/> <xsd:enumeration value="measure"/> <xsd:enumeration value="name"/> <xsd:enumeration value="net"/> <xsd:enumeration value="net:email"/> <xsd:enumeration value="net:uri"/> <xsd:enumeration value="address"/> <xsd:enumeration value="telephone"/> </xsd:restriction> </xsd:simpleType> <xsd:annotation> <xsd:documentation>General attributes</xsd:documentation> </xsd:annotation> <xsd:annotation> <xsd:documentation>Elements</xsd:documentation> </xsd:annotation> <xsd:element name="aws" abstract="true"> <xsd:annotation> <xsd:documentation>The 'allowed-within-sentence' group uses this abstract element. Elements with aws as their substitution class are then alternatives for 'allowed-within-sentence'.</xsd:documentation> </xsd:annotation> </xsd:element> <xsd:group name="allowed-within-sentence"> <xsd:choice> <xsd:element ref="aws"/> </xsd:choice> </xsd:group> <xsd:element name="struct" abstract="true"/> <xsd:group name="structure"> <xsd:choice> <xsd:element ref="struct"/> </xsd:choice> </xsd:group> <xsd:element name="speak" type="speak"/> <xsd:complexType name="speak" mixed="true"> <xsd:choice minOccurs="0" maxOccurs="unbounded"> <xsd:group ref="allowed-within-sentence"/> <xsd:group ref="structure"/> </xsd:choice> <xsd:attribute name="version" use="required"> <xsd:simpleType> <xsd:restriction base="xsd:NMTOKEN"/> </xsd:simpleType> </xsd:attribute> <xsd:attribute ref="xml:lang"/> </xsd:complexType> <xsd:element name="paragraph" type="paragraph" substitutionGroup="struct"/> <xsd:element name="p" type="paragraph" substitutionGroup="struct"/> <xsd:complexType name="paragraph" mixed="true"> <xsd:choice minOccurs="0" maxOccurs="unbounded"> <xsd:group ref="allowed-within-sentence"/> <xsd:element ref="sentence"/> <xsd:element ref="s"/> </xsd:choice> <xsd:attribute ref="xml:lang"/> </xsd:complexType> <xsd:element name="sentence" type="sentence" substitutionGroup="struct"/> <xsd:element name="s" type="sentence" substitutionGroup="struct"/> <xsd:complexType name="sentence" mixed="true"> <xsd:sequence minOccurs="0" maxOccurs="unbounded"> <xsd:group ref="allowed-within-sentence"/> </xsd:sequence> <xsd:attribute ref="xml:lang"/> </xsd:complexType> <xsd:element name="voice" type="voice" substitutionGroup="aws"/> <xsd:complexType name="voice" mixed="true"> <xsd:choice minOccurs="0" maxOccurs="unbounded"> <xsd:group ref="allowed-within-sentence"/> <xsd:group ref="structure"/> </xsd:choice> <xsd:attribute name="gender"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="male"/> <xsd:enumeration value="female"/> <xsd:enumeration value="neutral"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> <xsd:attribute name="age" type="xsd:positiveInteger"/> <xsd:attribute name="variant" type="xsd:integer"/> <xsd:attribute name="name" type="xsd:string"/> <xsd:attribute ref="xml:lang"/> </xsd:complexType> <xsd:element name="prosody" type="prosody" substitutionGroup="aws"/> <xsd:complexType name="prosody" mixed="true"> <xsd:choice minOccurs="0" maxOccurs="unbounded"> <xsd:group ref="allowed-within-sentence"/> <xsd:group ref="structure"/> </xsd:choice> <xsd:attribute name="pitch"> <xsd:simpleType> <xsd:union memberTypes="xsd:positiveInteger relative.change relative.change.st height.scale"/> </xsd:simpleType> </xsd:attribute> <xsd:attribute name="contour" type="xsd:string"/> <xsd:attribute name="range"> <xsd:simpleType> <xsd:union memberTypes="number.range relative.change relative.change.st height.scale"/> </xsd:simpleType> </xsd:attribute> <xsd:attribute name="rate"> <xsd:simpleType> <xsd:union memberTypes="xsd:positiveInteger relative.change speed.scale"/> </xsd:simpleType> </xsd:attribute> <xsd:attribute name="duration" type="duration"/> <xsd:attribute name="volume"> <xsd:simpleType> <xsd:union memberTypes="float.range1 relative.change volume.scale"/> </xsd:simpleType> </xsd:attribute> </xsd:complexType> <xsd:element name="audio" type="audio" substitutionGroup="aws"/> <xsd:complexType name="audio" mixed="true"> <xsd:choice minOccurs="0" maxOccurs="unbounded"> <xsd:group ref="allowed-within-sentence"/> <xsd:group ref="structure"/> </xsd:choice> <xsd:attribute name="src" type="xsd:anyURI"/> </xsd:complexType> <xsd:element name="emphasis" type="emphasis" substitutionGroup="aws"/> <xsd:complexType name="emphasis" mixed="true"> <xsd:sequence minOccurs="0" maxOccurs="unbounded"> <xsd:group ref="allowed-within-sentence"/> </xsd:sequence> <xsd:attribute name="level" default="moderate"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="strong"/> <xsd:enumeration value="moderate"/> <xsd:enumeration value="none"/> <xsd:enumeration value="reduced"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </xsd:complexType> <xsd:element name="sub" type="sub" substitutionGroup="aws"/> <xsd:complexType name="sub"> <xsd:simpleContent> <xsd:extension base="xsd:string"> <xsd:attribute name="alias" type="xsd:string" use="required"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> <xsd:element name="say-as" type="say-as" substitutionGroup="aws"/> <xsd:complexType name="say-as" mixed="true"> <xsd:attribute name="type" type="Say-as.datatype" use="required"/> </xsd:complexType> <xsd:element name="phoneme" type="phoneme" substitutionGroup="aws"/> <xsd:complexType name="phoneme" mixed="true"> <xsd:attribute name="ph" type="xsd:string" use="required"/> <xsd:attribute name="alphabet" type="xsd:string" default="ipa"/> </xsd:complexType> <xsd:element name="break" type="break" substitutionGroup="aws"/> <xsd:complexType name="break"> <xsd:attribute name="size" default="medium"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="large"/> <xsd:enumeration value="medium"/> <xsd:enumeration value="small"/> <xsd:enumeration value="none"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> <xsd:attribute name="time" type="duration"/> </xsd:complexType> <xsd:element name="mark" type="mark" substitutionGroup="aws"/> <xsd:complexType name="mark" mixed="true"> <xsd:choice minOccurs="0" maxOccurs="unbounded"> <xsd:group ref="allowed-within-sentence"/> <xsd:group ref="structure"/> </xsd:choice> <xsd:attribute name="name" type="xsd:ID" use="required"/> </xsd:complexType> </xsd:schema>
This appendix is Normative.
SSML requires that a platform support the playing of the audio formats specified below.
Audio Format | Media Type |
---|---|
Raw (headerless) 8kHz 8-bit mono mu-law [PCM] single channel. (G.711) | audio/basic (from http://www.ietf.org/rfc/rfc1521.txt) |
Raw (headerless) 8kHz 8 bit mono A-law [PCM] single channel. (G.711) | audio/x-alaw-basic |
WAV (RIFF header) 8kHz 8-bit mono mu-law [PCM] single channel. | audio/wav |
WAV (RIFF header) 8kHz 8-bit mono A-law [PCM] single channel. | audio/wav |
The 'audio/basic' mime type is commonly used with the 'au' header format as well as the headerless 8-bit 8Khz mu-law format. If this mime type is specified for recording, the mu-law format must be used. For playback with the 'audio/basic' mime type, platforms must support the mu-law format and may support the 'au' format.
This appendix is Non-Normative.
The W3C Voice Browser Working Group has applied to IETF to register a MIME type for the Speech Synthesis Markup Language. The current proposal is to use "application/ssml+xml".
The W3C Voice Browser Working Group has adopted the convention of using the ".ssml" filename suffix for Speech Synthesis Markup Language documents where "speak" is the root element.
This appendix is Non-Normative.
The following features are under consideration for versions of the Speech Synthesis Markup Language Specification after version 1.0:
This appendix is Normative.
SSML is an application of XML 1.0 and thus supports Unicode which defines a standard universal character set.
Additionally, SSML provides a mechanism for precise control of the input and output languages via the use of "xml:lang" attribute. This facility provides: