Department of Theoretical and Applied Linguistics, School of English Language and Literature, Aristotle University of Thessaloniki.




Research & Projects
          Greek Speech
          Hearing Impairment
          Phonological Development
          Pronunciation Teaching
          Past Projects



Anna Sfakianaki

Box of Tricks

Box of Tricks is the final product of the SPECO Project (1999-2002) which was funded by the EU through the INCO-COPERNICUS program (Contract no. 977126). The project’s head was Klara Vicsi (Technical University of Budapest, Hungary) who developed the Hungarian version. Box of Tricks was also developed in three other languages: English by Peter Roach & Anna Sfakianaki (University of Reading, United Kingdom), Swedish by Anne-Marie Oster (Kungl. Tekniska Hogskolan, Sweden) and Slovenian by Zdravko Kacic (University of Maribor, Slovenia). There was also a commercial partner, Peter Barczikay (Robot Control Software, Hungary) who was involved in the programming and is still involved in the marketing and sales of Box of Tricks.


Box of Tricks is developing a workstation that provides real-time visual display of acoustic information for children in need of assistance with various aspects of speech production. During the process of learning speech, children with normal hearing follow a product-oriented approach. They discover how to control their speech organs through reference to acoustic speech signals. In this way they develop the ability to generate all the acoustic effects occurring in speech. Naturally, this process is problematic for speech impaired people. In traditional speech therapy a process-oriented approach is generally used; the speech therapist gives instructions on how to use the speech organs while forming sounds. Nevertheless, during normal speech development, children never receive instructions on how to move or where to place their speech articulators.

Instead of the process-oriented approach, or to supplement it, Box of Tricks hopes to offer a product-oriented one. In speech communication it is not the process of the articulation that is important, but the quality of the produced sound by which the information is transmitted to the other person. In Box of Tricks -developed for hearing-impaired children, the produced sound is measured and visualised. The user discovers how to control his or her speech organs by comparing the visual patterns (speech pictures) of the normal acoustic speech signal with the defective one. Additionally, the acoustic pre-processing of the system uses a special filter bank imitating the filtering characteristics of the inner ear. So the speech picture should be much more similar to the perceived one than a simple bank of traditional filters, or FFT spectra.

The components of the system

The system consists of two basic parts. The first part consists of a language-independent editor and measuring system which is used to construct the modules for all SPECO languages. This language-independent editor can be adapted to any European language. The second part consists of language-dependent speech databases. The participating languages are English, Hungarian, Slovenian and Swedish, thus there are four reference speech databases, which the system uses in order to make a decision about the microphone input.

The Child Speech Database

Each language has two databases: the reference-speaker database and the multi-speaker database. The four language versions are divided into two packages: the fricative and affricate support and the vowel support. Regarding the English version, the fricative and affricate support includes the sibilants s, z, S, Z and the affricates tS, and dZ. The vowel support includes the five long vowels i:, 3:, A:, O: and u:, and the six short vowels I, e, Q, {, V and U (symbols are in SAMPA).

The fricative and affricate support was recorded with our reference speaker, Charlie, when he was eight years old, and the vowel support was recorded about a year later. All recordings were carried out in the sound-deadened recording room in the speech lab at the university of Reading, using the special editor incorporated in the SPECO system. Each utterance was recorded three times and the best one was saved and chosen to appear as the reference example in the exercises. The reference database was segmented using a special application within the SPECO editor. The reference examples were segmented so as to feed the system with information about the normal range of each phoneme and to demonstrate the arbitrary limits of each phoneme in the exercise window to assist the speech therapist and the client in training.

The multi-speaker database contains a portion of the reference material. 36 children aged between 7 and 11 were recorded. Each recording session took approximately 8-12 minutes, depending mostly on how fast the child could read the utterances from the cards. The speakers were selected from three different schools. Two of these schools are situated in or near Reading and the third one is in a suburb of London. It may be worth noting that some children had problems articulating certain fricative and affricate sounds, most commonly [Z] and [dZ], especially in isolation. There were also some articulation problems concerning the sounds [r] and [T]. The multi-speaker database was also segmented but this time using software (WASP) not incorporated in the editor itself.

Both databases have been used to establish norms which guide the teaching or remediation process. The segmented material was used in the construction of fricative and vowel spectra -“spreadlines”, as we call them, and determined the allowed spectral deviation. The spreadlines are constructed for each language separately and constitute the actual background of the exercise.

Types of display

The concept of the SPECO system is to visualise speech at a low level of speech processing and to let clients use their high level information processing ability to work on this. Teaching children how to obtain information from speech pictures is more preferable to giving articulation instructions. A detailed examination has been prepared to decide what scale of loudness, pitch contour, spectral distribution, etc., gives the most informative visual presentation (speech pictures) of these parameters. How can we draw children’s attention to the areas of maximum energy in the spectrogram? How can we encourage them to use correct loudness and intonation levels? How can children recognise if their rhythm is appropriate etc.? Generally we use different amusing background drawings to help children find the important parts of the speech pictures. First of all, each phoneme is assigned its own symbolic picture so that the child very quickly find out which are the significant parts of the screen (Figure 1).

Cochleagrams of English fricatives and affricates

Figure 1 The top picture shows typical cochleagrams of the English fricatives and affricates trained by the system. Each sound corresponds to a particular drawing (bottom picture) so that the client can make the necessary association when looking at the speech picture. For example, the correct production of an [s] (which is symbolised with a snake) would cover the most part of the eggs with dots.


Some examples of types of speech pictures are the following: energy changing with time (Figure 2); pitch; voiced - unvoiced detection; intonation; spectrum; spectrogram (cochleagram); spectrogram differences.

Energy changing with time

Figure 2
By saying pi pi pi, the child must make the yellow ball jump over the heads of the worms with the appropriate rhythm.

The system is based on up-to-date technology, but we follow the steps of traditional speech therapy in both modules. These are sound preparation, sound development, followed by training in words and automation (meaning the achievement of a reliable production not requiring further instruction).

At the stage sound preparation children are trained to pay the necessary attention to the screen. They start to familiarise themselves with the way curves form on the screen according to sound energy and the position of the speech organs. There is the possibility to train the adjustment of different speech parameters: loudness, rhythm, spectrum, pitch, voicing, intonation.

In sound development we start with the forming of individual phonemes. This stage includes working with articulation pictures, isolated pronunciation practice and syllable training. The articulation pictures (Figure 3) show the child which is the right position of all the organs (mouth, tongue, teeth etc.) that play a role in sound production. After teaching the correct articulation, children attempt to produce sustained sounds.

 Articulation picture for /z/

Figure 3 Articulation picture for fricative [z]; the little bell ringing indicates that there must be voicing when producing this sound.

The change of the energy measured in each frequency band is visible on the screen. The form of the distribution lines of the correctly pronounced phoneme is different in each case and characterises the phoneme itself (Figure 4). This exercise has three levels of difficulty. At the easiest level the stripe (distribution lines) is wider and the deviation is expected to be considerable, whereas at the most difficult level the stripe becomes narrower and the deviation should be small. The three levels of difficulty exist in both the fricative and the vowel support, but in vowel support the differences among the levels are not as substantial.

Spectrum of English /S/

Figure 4 The spectrum of the English fricative [S] presented as a speech picture by the program. The objective is to produce and sustain a line within the limits of the “green field”.

For syllable training, the vocabulary contains sound sequences constructed so that the phonemes being practised occur in different positions and contexts. These syllables appear on the screen in the form of cochleagrams. For the English fricative and affricate support, fricatives and affricates are presented in CV, VCV, VC and VC-VC-VC position and connected with the five long vowels. Whereas the English vowel support contains all vowels in syllables along with front stops, like [p, t and b]. The order of presentation of sound sequences could be important, so we grade those from the easier pronunciations to the more difficult ones. In this exercise, the reference syllable is demonstrated on the upper half of the screen, while the syllable the client produces appears in the bottom half of the screen (Figure 5). The client attempts to match his picture with the reference one as closely as possible.

 Cochleagram of syllable [3:s]

Figure 5 The reference syllable [3:s] appears on the top half of the screen and the client’s production below. The blue dots correspond to the vowel and the red dots to the fricative. The aim is to cover as much of the eggs as possible with red dots and leave the snake uncovered.

In the training in words the grouping of words is different in fricative support and vowel support. In fricative support all phonemes are presented in initial, medial and final position in words. In vowel support all phonemes occur in one-syllable words and in words of two or more syllables. Again the upper half of the screen shows the cochleagram of the reference word and the client has to produce the same word so as to fill in the right parts of the bottom half of the screen (Figure 6).

 Cochleagram of the word 'kitchen'

Figure 6 The reference speech picture (cochleagram of the English word ‘kitchen’) above and the client’s production below. The phoneme trained here is [tS] word-medially and its symbolic picture is a station (for the closure) a train (for the lower part of the cochleagram) and its smoke (for the actual release). The objective is to cover most of the smoke (release of [tS]) with red dots and leave the station and the train clear.


The automation (or “continuity” for the English version of the system) consists of two parts: contrast pairs and phrases. These exercises work on the basis of cochleagrams as well. The contrast pairs are presented to the child to show the differences between the speech pictures of two phonemes in similar words. For example one of the word pairs chosen to train the phoneme /z/ word-initially is “zip-dip”. The phrases contain the trained phoneme at least once and they are specially designed and graded from simple and short to complex and longer ones.

Our aim in the therapy is to reach that speech level at which the client speaks correctly without having to concentrate on the articulation. Therefore, besides the practice with phrases, we have included a category called “free exercise”. The therapist produces the reference example and the client attempts to produce the same example correctly. The example can be a syllable, a word, a pair or a sentence according to the client’s level, wishes and needs.

Gradually the clients learn how to interpret their spreadlines or the dots in the cochleagrams and can easily compare their productions with the model one. But until that happens or for very young children who cannot readily compare the two screens, Box of Tricks gives another type of feedback which can be easily understood. This automatic feedback is placed under the cochleagrams and can take different forms. Every form has five stages in order to demonstrate any subtle improvement or deterioration of each production. So the feedback can take the form of a duck which moves to the right and lifts its head in joy when the production is correct (Figure 6), a number which changes from 1 to 5 depending on the production (Figure 7), a flower which comes gradually out of its pot as the production improves and a colour which changes from red to green when the production is correct.

 Cochleagram of the phrase 'two hoovers'

Figure 7 The automatic feedback can change through ‘Settings’ and take the form of a duck, a number, a flower or a colour.

An additional feature of Box of Tricks which could be very helpful for the speech therapist is the ‘User Management’ tool. The therapist can have a login name and password and create files for all his or her clients. These files can be created during therapy by saving the client’s productions. Thus a database is created which contains the date of the recording, the type of exercise and the exact utterance which can be also played back (button ‘Say’ in Figure 8), the mark and the comments of the therapist at that time.

 Client's files

Figure 8 This shows the selected files of a certain client. These files can be edited and can be confidential if the therapist chooses so. Thus a database is created and the therapist can easily keep track of the client’s progress.


For more information about the project SPECO and the product Box of Tricks you can visit the official website:

If you wish to order the product you can get information about prices and other features of Box of Tricks from the company’s website:

or email the company manager, Peter Barczikay: