Per-Fide! — Dettagli del Progetto

Portuguese in parallel with six languages: Español, Russian, Français, Italiano, Deutsch, English

This Project is a joint venture between the Computer Science Department and the School of Humanities of the University of Minho, which is intended to establish a research environment for translation and literary studies, lexicography, contrastive linguistics, and language teaching, which will bring together researchers from different language departments. Our main objective is to give continuity to the work developed at the Language Resource Center for Portuguese, Linguateca, whose mission was to contribute to the Processing of the Portuguese Language. We count with the expertise of the researchers of the Braga node from this Center.

We aim to build parallel corpora that will establish a relation between Portuguese and the languages which are taught at the School of Humanities. These corpora will contain various language combinations in which the Portuguese language (in its different varieties: European Portuguese, Brazilian Portuguese and African Portuguese) is part of, either as source language or as target language. The other languages to be included in the corpus are: Spanish, Russian, French, Italian, German and English (Pt, Es, Ru, Fr, It, De, en – Per-Fide). We will give special attention to the language pairs that do not have large corpora available. For instance, the language pair Portuguese-English is served in terms of legislative and literary texts (EuroParl and JRC-Acquis for the legislative, and COMPARA for literary corpora). But other pairs, like Portuguese-Russian, are not available at all.

The developed corpora will contain original texts in the seven languages and their translations into as many as possible of the other six languages. Whenever possible, we will try to produce parallel corpora with more than two languages. The corpora will be divided into two main genres: contemporary fiction and non-fiction texts. In the non-fiction category we are considering religious texts (mainly Encyclicals, Letters and Angelus from the Vatican website); journalistic articles (Le Monde Diplomatique), judicial texts (European Community Law and international agreements) and technical texts (instruction/operating manuals, norms, standards and directives , technical texts and specialised documentation in the fields of automotive industry, electronics, telecommunications, computer science, standardization, pharmaceutical industry and medicine/health sciences). The fiction category will include contemporary novels and short stories. We would like to focus primarily Portuguese authors.

In relation to copyright issues, we already have copyright clearance to use the Vatican texts, the Portuguese and French version of Le Monde Diplomatique (other languages are being negotiated). The Portuguese publishing house Caminho is willing to provide some literary books for inclusion in our corpus.

The proposed team has background knowledge on corpora construction: a parallel corpus of French-Portuguese based on texts of Le Monde Diplomatique [Cor06] was compiled in the scope of a Ph.D thesis [Ara08], a Portuguese-German parallel corpus consisting of José Saramago′s text “Ensaio sobre a Cegueira”, in the scope of a Master thesis [Dia02] and a Portuguese-Spanish parallel corpus was compiled on the basis of the dialogues in the film “Todo sobre mi madre” by Almodóvar, in the scope of another master thesis [San07]. Also worth mentioning is the project UMPessoa Paralelo, an initiative aiming at the creation of a parallel corpus based on the works of Fernando Pessoa, and built upon the available translations into French, English and Spanish of the book “O Livro do Dessassossego”.

A set of monolingual corpora, which are freely available on the Internet, were developed in the context of the Linguateca project. Other (including financed) projects have been developing corpora but not making them available. We defend that corpora should not be made available only in an online concordance system, but, whenever possible, released in textual format for download and local processing by any natural language researcher. Thus, corpora will be available for download both in the Text Encoding Initiative (TEI) and the XML Corpus Encoding Standard (XCES) formats.

Corpora are even more interesting when annotated. We intend to use freely available morpho-syntactic taggers to add disambiguated morphological information to each word, for all languages. Unfortunately, for some languages free taggers are not available. We intend to contact Syddansk Universitet to acquire their tagger, PALAVRAS, known to be the most robust tagger for some of the European languages.

Some of the researchers’ members have a solid background in the extraction of bilingual resources from parallel corpora, ranging from probabilistic translation dictionaries to bilingual terminology and translation examples [SA06a]. Given this knowledge, another objective of this project is to compute these kinds of resources, making them available for download or query on the Internet.