French Treebank

Documentation

Documentation

Formats

The corpus is available in four different versions:

File type: Text*
Workstation: PC, Unix, MacOS
Format in use: XML

Annotation choices

To understand the annotation choices, please read the documentation:

Morphosyntactic annotation

We define a complete morphosyntactic tag as follows:

For part of speech, we made traditional choices, except for weak pronouns that were given a POS of their own (clitic) according to the generative tradition, and foreign words (in quotations) which receive a special ET tag. Punctuations are divided between strong (clause markers) and weak (all the others). Most typographical signs (including %, numbers and abbreviations) are assigned a traditional POS (usually common noun).

We distinguish 15 lexical categories, used for simple words as well as for compounds:

Constituent annotation

We have chosen surface and shallow annotations, compatible with various syntactic frameworks.

Our phrasal tagset is as follows:

We chose to only annotate major phrases with little internal structure. For the sake of simplicity, we make parsimonious use of unary phrases. For rigid sequences of categories, such as dates or addresses, it is difficult to determine the head, and we have one global NP with no internal constituents.

We annotate certain phrases with a subcategory, which is important for functional annotation, for example relative or subordinate for embedded clauses.

We do not have discontinuous constituents.

In order to be as therory neutral as possible, we neither use empty categories, nor functional phrases (non DP or CP). We allow for headless phrases (elliptical NP lacking a head Noun or sentential clauses lacking a verbal nucleus).

Unexpressed subjects (in infinitive or participials) will be marked at the functional level.

For verbal phrases, we only annotate the minimal verbal nucleus (clitics, auciliaries, negation and verb), because the traditional VP (with complements) is subject to much linguistic debate and is often discontinuous in French.

For coordination, we only mark a coordinating phrase after a coordinating conjunction. We do not necessarily embed conjuncts inside a phrase since there are cases where the is none, and there are cases where the category of the phrase would be unspecified.

Function annotation

We have chosen to annotate grammatical functions associated with major constituents which are dependent of a Verb (or VN).

Our functional tagset is as follows:

No more than one fucntion can be tagged on a constituent, except for verbal nucleus which bear all the functions of their pronominal clitics.

Only surface functions are encoded : we code the subject of the passive as a subject, and the postverbal NP in an impersonal construction (Il est venu 3 hommes) as an object.

We do not code the fact that a subject or an object of a given verb can also be the subject of an ambedded Vinf for example.

Parentheticals usually have the function MOD.

Emebdded phrases such as Srel dont have a function, except if they are extraposed or clefted (with the function MOD). COORD dont have a function except in the case of multiple coordinations, where each COORD has the same function (Ni Paul ni Marie ne viendra).

In the same clause several constituants can have the same function.

We do not code the link between the dependent and the head, so long distance dependencies are not taken into account.

Examples

Version history

Note the 1.0 release is the first full release, meaning the first release in which for all sentences all morpho-syntactic tags are available:

Before the 1.0 release, several beta versions have been released, in which only a subset of the sentences contained grammatical functions.

We list a few of these versions: