Skip to content

Morphological dictionary and multi-word tokens #99

Description

@jeanm

(First of all, congrats on UDPipe, it's a pleasure to use!)

I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated FORM,LEMMA,UPOS,XPOS,FEATS format so that I can also use it with UDPipe.

Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that au should be split and tagged like so:

1-2 au _ _ _ _
1 à à ADP ADP _
2 le le DET DET Definite=Def|Gender=Masc|Number=Sing|PronType=Art

If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?

I was thinking of something like the following:

au   _   _   _   _   SplitForm=à/le|SplitLemma=à/le|SplitUPos=ADP/DET|SplitFeats=_/Definite=Def,Gender=Masc,Number=Sing,PronType=Art

It does seem awfully verbose though...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions