(First of all, congrats on UDPipe, it's a pleasure to use!)
I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated FORM,LEMMA,UPOS,XPOS,FEATS format so that I can also use it with UDPipe.
Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that au should be split and tagged like so:
| 1-2 |
au |
_ |
_ |
_ |
_ |
| 1 |
à |
à |
ADP |
ADP |
_ |
| 2 |
le |
le |
DET |
DET |
Definite=Def|Gender=Masc|Number=Sing|PronType=Art |
If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?
I was thinking of something like the following:
au _ _ _ _ SplitForm=à/le|SplitLemma=à/le|SplitUPos=ADP/DET|SplitFeats=_/Definite=Def,Gender=Masc,Number=Sing,PronType=Art
It does seem awfully verbose though...
(First of all, congrats on UDPipe, it's a pleasure to use!)
I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated
FORM,LEMMA,UPOS,XPOS,FEATSformat so that I can also use it with UDPipe.Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that au should be split and tagged like so:
If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?
I was thinking of something like the following:
It does seem awfully verbose though...