The Floresta Sintá(c)tica project

logo temporário da FS

Página principal


Floresta Sintá(c)tica (syntactic forest) is a publicly available treebank for Portuguese, created as a collaboration project between the VISL project, http://visl.sdu.dk, and Linguateca (formerly the Computational Processing of Portuguese project), http://www.linguateca.pt.

The Floresta is based on human revision of the output of the PALAVRAS parser, developed by Eckhard Bick for his PhD (1994-2000) at the University of Århus (Denmark). The parser is available on the Web at the VISL project site (http://visl.sdu.dk). More information about the parser can be found in Bick, Eckhard. The Parsing System Palavras, Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework, Aarhus University Press, 2000.

The Floresta Sintá(c)tica project comprises three corpora:

These three corpora can be queried through Milhafre, a user-friendly search interface.

Documentation

Please see the paper at LREC'2002 for a general description of the project. All information in English so far is listed below:

Team

Project leaders: Diana Santos (to September 2007) and Eckhard Bick.

Linguistic revision
Susana Afonso (November 2000 to 2005)
Raquel Marchi (November 2000 to September 2001; Jan 2003 to 2005)
Anabela Barreiro Colasuonno (May-December 2002)
Cláudia Freitas (June 2007 to present)

Tool development
Renato Haber (November 2000 to September 2001)
Luís Sarmento (November-December 2002)
Rui Vilela (August 2004 to December 2005)
Paulo Rocha (June 2007 to present)

Results

The Floresta Sintá(c)tica project has so far produced:

Each tree of our treebank corresponds to three different objects:

  1. CG representation in text format
  2. Phrase tree in text format
  3. Phrase tree in graphical format
2. and 3. contain exactly the same information and just differ in presentation mode, while 1. does not contain constituents nor attachment information (only dependency). We have some example sentences to illustrate the three objects.

Access

Bosque

One can download the phrase trees that constitute the Bosque (v7.6), in several formats, from the main page:

They can also be individually inspected in graphical format at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Floresta sintá(c)tica treebank", http://visl.sdu.dk/visl/pt/floresta.html?S=cetemcorpus#top.

Or at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Non-automatic parse", "Pre-analysed Portuguese sentences", "Newspaper corpus treebank (Floresta)" http://visl.sdu.dk/visl/pt/treebank.html, and clicking on the figura de árvore gráfica no projecto VISL figure preceding each sentence.

The Bosque is also available in the Penn Treebank and TIGER formats, in XML, through the work of the Braga node of Linguateca, see Floresta page at Braga.

The Bosque 7.3 was used for the ConLL-X shared task on multilingual dependency parsing. We are grateful to Sabine Buchholz for processing Bosque and making it available for the ConLL-X exercise. These data provided here have been prepared by her and her team, we just make it available as is from here.

Finally, they can be queried through

Floresta Virgem

Most of Floresta Virgem can also be queried through Milhafre and Águia:

Last update: 8 July 2008
Comments and suggestions about the Floresta treebank