Workshop sur l’annotation des corpus multilingues

3 octobre 2011

Workshop sur l’annotation des corpus multilingues coorganisé par les deux Fédérations de linguistiques, Isabelle Leglise (Fédération Typologie et Universaux linguistiques), Lorenza Mondada (Institut de Linguistique Française) Paris, 3.10.2011

Multilingual corpora represent an interesting concentrated mixture of most of the problems raised by monolingual corpora and some extra challenges.

Central issues are related to problems of variation and non-standard forms, often ignored by big national corpora or controlled by rather general criteria (like quite general typologies of texts and discourses defining the kind of data collected). These variations transcend internal variation observable within a single language ; they often even question the categorization of linguistic forms as belonging to a given language or a given variety vs. another. Corpora containing code-switching, code-mixing, hybridization phenomena, heterogeneous uses of a lingua franca variably mobilized within diverse social practices and linguistic competences, raise a range of methodological and theoretical questions - such as problems of identification, notation, transcription, and categorization of hybrid forms. These problems have crucial consequences for the annotation of corpora, for the definition and delimitation of what a multilingual corpus is, for the choice of relevant contexts of practice to be documented, etc.

The workshop aims at debating these problems, on the basis of

- a) data bases of multilingual corpora already achieved - for which examples of the problems and solutions will be given.
- b) excerpts of multilingual corpora on which problems of transcription, annotation, and exploitations will be illustrated and discussed.