Skip to content

tokenize by script boundaries - only #327

@mediabuff

Description

@mediabuff

I am trying to tokenize multilingual (rather multi script) strings - into components where each component is of only one script (as defined by Unicode). I tried using -segment_alphabet_change but this also breaks at spaces.
The following

the rootकृ in the sense of frequency; e.g. चर्करीति, चर्कर्ति, बोभवीति बोभोति

should break as 4 tokens

"the root" "कृ " "in the sense of frequency; e.g." "चर्करीति, चर्कर्ति, बोभवीति बोभोति"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions