Morphological dictionary and multi-word tokens

(First of all, congrats on UDPipe, it's a pleasure to use!)

I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated `FORM,LEMMA,UPOS,XPOS,FEATS` format so that I can also use it with UDPipe.

Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that _au_ should be split and tagged like so:

| 1-2 | au | _ | _ | _ | _ |
| --- | -- | -- | -- | -- | -- |
| 1 | à | à | ADP | ADP | _ |
| 2 | le | le | DET | DET | Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art |

If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?

I was thinking of something like the following:
```
au   _   _   _   _   SplitForm=à/le|SplitLemma=à/le|SplitUPos=ADP/DET|SplitFeats=_/Definite=Def,Gender=Masc,Number=Sing,PronType=Art
```

It does seem awfully verbose though...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Morphological dictionary and multi-word tokens #99

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

1-2	au	_	_	_	_
1	à	à	ADP	ADP	_
2	le	le	DET	DET	Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art

Uh oh!

Morphological dictionary and multi-word tokens #99

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions