We are pleased to present an article detailing the recent updates to an ambitious project named SyllabO+.

Started in the summer of 2013, the SyllabO+ project, developed by Pascale Tremblay and her team, particularly Pascale Bédard, then an undergraduate student in language sciences, aims to analyze Quebec French as it is spoken, taking into account speakers’ age, gender, and communication contexts. As part of the project, a large corpus of recordings from 184 Quebec French speakers, later expanded to 225, was compiled. Some participants were recorded in formal settings, others in informal ones. All recordings were transcribed using the International Phonetic Alphabet (IPA) by the team. The transcriptions were then segmented into words, syllables, and phonemes. Phonemes are the smallest sound units in a language that can differentiate meaning between words. For example, the words bath and path both contain the phonemes “a” and “th” (written /æ/ and /θ/ in IPA), but they differ in the phonemes /b/ and /p/.

In 2016, two databases were created: the phoneme and syllable databases, offering an inventory of the syllables and phonemes of spoken Quebec French. These databases allow for in-depth research on the structure and frequency of these language units. They provide information on how frequently certain phonemes and syllables occur, their transition probabilities (e.g., the likelihood that one syllable follows another), and the statistical relationships between them. Considering that few tools exist for the study of spoken Quebec French, SyllabO+ represents a valuable resource for research across multiple fields, including psycholinguistics, phonetics, speech-language pathology, and the cognitive neuroscience of language. One of its major strengths lies in its focus on spontaneous spoken language, offering an authentic snapshot of everyday language use.

SyllabO+ also stands out for its inclusion of a wide range of indicators from various linguistic domains—phonetics, phonology, lexicon, and morphology—all derived from the same group of speakers. This work was first published in the international journal Behavior Research Methods in 2017. To read the full article, click here. SyllabO+ is freely available online on our website (click here).

The article we present to you today is the second published as part of this project. It pursues three main objectives. First, it describes the expansion of the corpus of spoken Quebec French within the SyllabO+ project, as well as the creation, since 2017, of three new databases: words, lemmas, and morphemes. Second, it presents a study conducted to evaluate the semantic transparency of the words in the corpus. Finally, it explores the implications of these data for researchers in various fields, such as education, linguistics, and speech-language pathology.

The Update Process of the SyllabO+ Project and the Creation of the Unique Words, Lemmas, and Morphemes Databases

As part of the expansion of the SyllabO+ project, the original corpus was enlarged to enable the creation of three new databases: unique words, lemmas, and morphemes. This project was carried out by a team composed of Pascale Tremblay, Noémie Auclair-Ouellet, Pascale Bédard, Patrick Drouin, Alexandra Barbeau-Morrison, and, finally, Alexandra Lavoie, a lab assistant who segmented all the words in the corpus into morphemes. Details regarding the updates to the corpus are presented in Figure 1 below. These new databases are available on our website by clicking here.
Figure 1. Expansion of the SyllabO+ Corpus

The Database of Words and Lemmas

In 2018, two new databases—the word and lemma databases—were added to the SyllabO+ project. The development of these two tools involved two essential processes: tokenization and lemmatization. Tokenization is the process of dividing a text into smaller units, called “tokens,” which are usually words. Each word in the corpus was then entered into a database, along with grammatical information such as its grammatical gender (masculine vs. feminine), number (singular vs. plural), and conjugation markers. In the second phase, lemmatization allowed these words to be converted into their canonical form, or lemma, as one would find in a dictionary. For example, the lemma of the verb ran is run.

The Morpheme Database
Morphological analysis was carried out using the previously mentioned database of unique words. Each word was segmented into morphemes and then encoded in the database. A morpheme is the smallest unit of meaning in a language. For example, the word unbelievable contains three morphemes: the prefix un-(meaning “not”), the root believe- ( “to accept as true”), and the suffix -able (“capable of being”). A distinction is made between inflectional and derivational morphemes. Inflectional morphemes modify a word to indicate gender, number, or tense (such as verb endings), while derivational morphemes are used to create new words from existing ones.

To build the morpheme database, all the words in the corpus were analyzed in terms of their morphological structure. Monosyllabic words such as pain (bread) and chat (cat), as well as grammatical words like determiners, pronouns, prepositions, and adverbs, were excluded from the analysis because they cannot be segmented into morphemes. Each word was classified as either derived or non-derived, and both roots and affixes were identified and segmented to provide a precise description of their internal composition.

The Study of Semantic Transparency
In addition to presenting the creation of the three new databases, our article also aimed to describe the process used to assess the semantic transparency of derived words among speakers of Quebec French. As a reminder, a derived word is formed by adding an affix, such as a prefix or suffix, to a stem. Semantic transparency refers to how easily the meaning of a derived word can be inferred from the meanings of its component parts, particularly its root, prefix, or suffix. In other words, a word is considered semantically transparent when the relationship between its morphological structure and its meaning is clear to speakers. For instance, the meaning of to restart is likely easier to guess based on its prefix re- and the root start than the meaning of breakfast, in which the prefix break- and the stem fast together mean “to stop fasting.”

The study therefore focused on how speakers perceive this transparency, taking into account several factors: the type of affixation used in forming the derived word, and the characteristics of the participants, including their level of French proficiency (French as a first or second language). This research contributed to a richer morphological analysis of spoken Quebec French. Evaluating semantic transparency is essential, as morphology plays a significant role in reading fluency and word comprehension, especially for individuals who experience difficulties with word decoding.

To conduct this evaluation, an online study was carried out with over 400 voluntary participants (see Figure 2 below, which presents the semantic transparency study).
Figure 2. The process of the online semantic transparency survey

The word pairs included in the semantic transparency study were selected based on several criteria. First, to ensure that participants’ judgments were based solely on semantics, the stems of the derived words had to be real words. For example, the pair minuit–nuit (midnight–night) was included, but not midi–di (noon-oon).

Additionally, when words contained multiple affixes, the shortest meaningful unit was used as the stem. Furthermore, the infinitive form was used for all verb stems. For instance, the word relâchement (release) was compared to the stem lâcher (to let go) rather than relâcher (to release).

Greek or Latin stems were also replaced with their French equivalents, provided that the French version was semantically close and had sufficient phonological overlap with the original. This correspondence was confirmed using a prior phonetic transcription. Thus, the word
mémorisation (memorization) was compared to mémoire
(memory) rather than mémor.
 

Results of the Semantic Transparency Study
The results of the semantic transparency study indicate that most of the derived words examined show moderate to high levels of transparency. However, the data also reveal some variability in participants’ ratings, particularly for words with moderate transparency. This variability may be due to certain aspects of the study itself, such as the instructions given to participants.

Moreover, the results suggest that words with suffixes are more transparent than those with prefixes, and that words containing both a prefix and a suffix are less transparent than those containing only a prefix or only a suffix (see Figure 3 below).
Figure 3. The violin plots illustrate the distribution of transparency scores for words according to the type of affixation (prefixes, suffixes, or both). Wider sections in the middle indicate that many participants gave scores in that range, while narrower sections at the ends show that few participants gave extreme scores. The median (white line in the center of the box plot) represents the middle value of the data.
The results reveal no significant effect of the sociodemographic variables considered in the study on participants’ evaluations of semantic transparency. Although education showed the strongest effect, it is important to note that the sample’s representativeness was limited due to the high overall education level of the participants. While the study indicates that evaluations were generally similar between native French speakers and second-language (L2) French speakers, reading and listening comprehension skills did influence how L2 speakers evaluated semantic transparency. Specifically, transparency ratings between native and L2 speakers were more aligned when L2 speakers had higher reading proficiency in French, rather than higher oral comprehension skills.
 

What can SyllabO+ be used for?
First, the databases provided in SyllabO+ can be used in the field of rehabilitation, particularly in speech-language pathology. When working with individuals who have oral or written language difficulties, speech-language pathologists focus on phonemes, syllables, words, and morphological structure. Having access to a database that includes all of these linguistic units, along with information about their frequency of use, is an invaluable resource. In therapy sessions, clinicians can adjust the frequency (high or low) of the stimuli used with patients. This allows them to tailor the complexity of exercises according to specific therapeutic goals.

Our databases are also useful for learners of French as a second language. Mastering a second language, whether for children or adults, involves developing our awareness—the ability to identify and manipulate morphemes. These databases offer new material that teachers can use to design exercises focusing on morphological awareness, starting from the syllable, morpheme, lexeme, or word level, and adapted to the specific features of spoken Quebec French.

The data in SyllabO+ are also valuable to researchers studying language, such as in psycholinguistics or language neuroscience. Research teams can manipulate not only the words themselves but also variables such as word complexity, frequency of use, or length. This allows them to design various language or speech processing tasks, such as word, nonword, or syllable repetition; lexical decision tasks (e.g., “Does this word exist in the language?”); or speech perception analysis at the word, syllable, or phoneme level.

Finally, the semantic transparency study highlights that speakers do not perceive semantic transparency in a uniform way. Researchers can therefore select words from the study with consistent transparency ratings to create controlled psycholinguistic tasks.

In conclusion, with six databases now available, SyllabO+ is a valuable resource for advancing our understanding of language and speech. It supports the work of speech-language pathologists in clinical settings, as well as that of teachers in the context of second-language French instruction.
 
To read our full article: