Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 15 additions & 127 deletions packages/generaltranslation-icu-messageformat-parser/README.md
Original file line number Diff line number Diff line change
@@ -1,146 +1,34 @@
<p align="center">
<a href="https://generaltranslation.com/docs/python">
<picture>
<source media="(prefers-color-scheme: light)" srcset="https://generaltranslation.com/brand/gt-logo-light.svg">
<img alt="General Translation" src="https://generaltranslation.com/brand/gt-logo-dark.svg" width="100" height="100">
</picture>
</a>
</p>

<p align="center">
<a href="https://generaltranslation.com/docs/python"><strong>Documentation</strong></a> · <a href="https://github.com/generaltranslation/gt-python/issues">Report Bug</a>
</p>

# generaltranslation-icu-messageformat-parser

> ⚠️ **Experimental / Unstable** — This package is under active development and may be subject to breaking changes.

A pure-Python ICU MessageFormat parser with whitespace-preserving AST and string reconstruction. Python equivalent of [`@formatjs/icu-messageformat-parser`](https://www.npmjs.com/package/@formatjs/icu-messageformat-parser).

Derived from [pyicumessageformat](https://github.com/SirStendec/pyicumessageformat) by Mike deBeaubien (MIT license).
A pure-Python ICU MessageFormat parser. Python equivalent of [`@formatjs/icu-messageformat-parser`](https://www.npmjs.com/package/@formatjs/icu-messageformat-parser).

## Installation

```bash
pip install generaltranslation-icu-messageformat-parser
```

No dependencies. Pure Python. Requires Python 3.10+.

## Quick Start

```python
from generaltranslation_icu_messageformat_parser import Parser, print_ast

parser = Parser()
ast = parser.parse("{count, plural, one {# item} other {# items}}")
# [{'name': 'count', 'type': 'plural', 'offset': 0, 'options': {'one': [{'type': 'number', 'name': 'count', 'hash': True}, ' item'], 'other': [{'type': 'number', 'name': 'count', 'hash': True}, ' items']}}]
```

## API

### `Parser(options=None)`

Create a parser instance with optional configuration.

**Options dict keys:**

| Option | Type | Default | Description |
|---|---|---|---|
| `subnumeric_types` | `list[str]` | `['plural', 'selectordinal']` | Types that support `#` hash replacement |
| `submessage_types` | `list[str]` | `['plural', 'selectordinal', 'select']` | Types with sub-message branches |
| `maximum_depth` | `int` | `50` | Maximum nesting depth |
| `allow_tags` | `bool` | `False` | Enable XML-style `<tag>` parsing |
| `strict_tags` | `bool` | `False` | Strict tag parsing mode |
| `tag_prefix` | `str \| None` | `None` | Required tag name prefix |
| `tag_type` | `str` | `'tag'` | AST node type string for tags |
| `include_indices` | `bool` | `False` | Include `start`/`end` positions in AST nodes |
| `loose_submessages` | `bool` | `False` | Allow loose submessage parsing |
| `allow_format_spaces` | `bool` | `True` | Allow spaces in format strings |
| `require_other` | `bool` | `True` | Require `other` branch in plural/select |
| `preserve_whitespace` | `bool` | `False` | Store whitespace in `_ws` dict on AST nodes for lossless round-trips |

### `Parser.parse(input, tokens=None)`

Parse an ICU MessageFormat string into an AST.

**Args:**
- `input` (`str`): The ICU MessageFormat string to parse.
- `tokens` (`list | None`): Optional list to populate with token objects for low-level analysis.

**Returns:** `list` — A list of AST nodes (strings and dicts).

**Raises:** `SyntaxError` on malformed input, `TypeError` if input is not a string.

### `print_ast(ast)`

Reconstruct an ICU MessageFormat string from an AST.

**Args:**
- `ast` (`list`): The AST as returned by `Parser.parse()`.

**Returns:** `str` — The reconstructed ICU MessageFormat string.

When the AST contains `_ws` whitespace metadata (from `preserve_whitespace=True`), reconstruction is lossless — the output exactly matches the original input. Without whitespace metadata, normalized spacing is used.

## AST Node Types

### String literal
Plain strings appear directly in the AST list:
```python
parser.parse("Hello world")
# ["Hello world"]
print(print_ast(ast)) # "{count, plural, one {# item} other {# items}}"
```

### Simple variable `{name}`
```python
{"name": "username"}
```

### Typed placeholder `{name, type, style}`
```python
{"name": "amount", "type": "number", "format": "::currency/USD"}
```

### Plural / selectordinal `{n, plural, ...}`
```python
{
"name": "count",
"type": "plural", # or "selectordinal"
"offset": 0, # offset value (0 if none)
"options": {
"one": [{"type": "number", "name": "count", "hash": True}, " item"],
"other": [{"type": "number", "name": "count", "hash": True}, " items"],
"=0": ["no items"], # exact match keys
}
}
```

### Select `{gender, select, ...}`
```python
{
"name": "gender",
"type": "select",
"options": {
"male": ["He"],
"female": ["She"],
"other": ["They"],
}
}
```

### Hash `#` (inside plural/selectordinal)
```python
{"type": "number", "name": "count", "hash": True}
```

### With `include_indices=True`
All dict nodes gain `start` and `end` integer fields indicating byte positions in the original string.

### With `preserve_whitespace=True`
Dict nodes gain a `_ws` dict storing whitespace at each structural position, enabling lossless `print_ast()` round-trips.

## Supported ICU Features

- Simple variable interpolation: `{name}`
- Plural with CLDR categories: `{n, plural, one {...} other {...}}`
- Exact match: `{n, plural, =0 {...} =1 {...} other {...}}`
- Plural offset: `{n, plural, offset:1 ...}`
- Selectordinal: `{n, selectordinal, one {#st} two {#nd} few {#rd} other {#th}}`
- Select: `{gender, select, male {...} female {...} other {...}}`
- Nested expressions: plural inside select, select inside plural, etc.
- Typed placeholders: `{amount, number}`, `{d, date, short}`
- ICU escape sequences: `''` for literal quote, `'{...}'` for literal braces
- Hash `#` replacement inside plural/selectordinal branches
- XML-style tags (opt-in): `<bold>text</bold>`

## Known Limitations

- **Escape sequences are consumed during parsing.** `''` becomes `'` and `'{...}'` becomes `{...}` in the AST. These cannot be reconstructed by `print_ast()`. This matches the behavior of `@formatjs/icu-messageformat-parser`.
167 changes: 13 additions & 154 deletions packages/generaltranslation-intl-messageformat/README.md
Original file line number Diff line number Diff line change
@@ -1,175 +1,34 @@
<p align="center">
<a href="https://generaltranslation.com/docs/python">
<picture>
<source media="(prefers-color-scheme: light)" srcset="https://generaltranslation.com/brand/gt-logo-light.svg">
<img alt="General Translation" src="https://generaltranslation.com/brand/gt-logo-dark.svg" width="100" height="100">
</picture>
</a>
</p>

<p align="center">
<a href="https://generaltranslation.com/docs/python"><strong>Documentation</strong></a> · <a href="https://github.com/generaltranslation/gt-python/issues">Report Bug</a>
</p>

# generaltranslation-intl-messageformat

> ⚠️ **Experimental / Unstable** — This package is under active development and may be subject to breaking changes.

ICU MessageFormat formatter with locale-aware plural and select rules. Python equivalent of [`intl-messageformat`](https://www.npmjs.com/package/intl-messageformat).

Uses [`generaltranslation-icu-messageformat-parser`](../generaltranslation-icu-messageformat-parser) for parsing and [Babel](https://babel.pocoo.org/) for CLDR plural rules.

## Installation

```bash
pip install generaltranslation-intl-messageformat
```

Dependencies: `generaltranslation-icu-messageformat-parser`, `babel>=2.18.0`. Pure Python, no C extensions.

## Quick Start

```python
from generaltranslation_intl_messageformat import IntlMessageFormat

# Simple variable interpolation
mf = IntlMessageFormat("Hello, {name}!", "en")
mf.format({"name": "World"}) # "Hello, World!"

# Plural with CLDR rules
mf = IntlMessageFormat("{count, plural, one {# item} other {# items}}", "en")
mf.format({"count": 1}) # "1 item"
mf.format({"count": 5}) # "5 items"
mf.format({"count": 1000}) # "1,000 items"

# Select
mf = IntlMessageFormat("{gender, select, male {He} female {She} other {They}} left.", "en")
mf.format({"gender": "female"}) # "She left."

# Selectordinal
mf = IntlMessageFormat("{n, selectordinal, one {#st} two {#nd} few {#rd} other {#th}}", "en")
mf.format({"n": 1}) # "1st"
mf.format({"n": 22}) # "22nd"
mf.format({"n": 3}) # "3rd"
mf.format({"n": 4}) # "4th"
```

## API

### `IntlMessageFormat(pattern, locale="en")`

Create a message formatter.

**Args:**
- `pattern` (`str`): An ICU MessageFormat pattern string.
- `locale` (`str`): A BCP 47 locale tag. Defaults to `"en"`. Falls back to `"en"` if the locale is invalid.

### `IntlMessageFormat.format(values=None)`

Format the message with variable values.

**Args:**
- `values` (`dict | None`): A dict mapping variable names to values. Values can be `str`, `int`, `float`, or any type convertible via `str()`. Missing variables resolve to empty string.

**Returns:** `str` — The formatted message.

### `IntlMessageFormat.pattern`

**Type:** `str` — The original pattern string (read-only property).

### `IntlMessageFormat.locale`

**Type:** `babel.Locale` — The resolved Babel locale (read-only property).

## Supported ICU Features

### Simple variables
```python
IntlMessageFormat("Hello, {name}!", "en").format({"name": "World"})
# "Hello, World!"
```

### Plural
Selects a branch based on CLDR plural rules for the locale. Supports `one`, `two`, `few`, `many`, `other`, and `zero` categories, plus exact matches with `=N`.

```python
# English: one/other
IntlMessageFormat("{n, plural, one {# dog} other {# dogs}}", "en").format({"n": 1})
# "1 dog"

# Arabic: zero/one/two/few/many/other
IntlMessageFormat(
"{n, plural, zero {صفر} one {واحد} two {اثنان} few {# قليل} many {# كثير} other {# آخر}}", "ar"
).format({"n": 3})
# "3 قليل"

# Russian: one/few/many/other
IntlMessageFormat(
"{n, plural, one {# книга} few {# книги} many {# книг} other {# книг}}", "ru"
).format({"n": 21})
# "21 книга"
```

### Exact match
```python
IntlMessageFormat(
"{n, plural, =0 {no items} =1 {one item} other {# items}}", "en"
).format({"n": 0})
# "no items"
```

### Plural with offset
The `offset` value is subtracted before plural rule evaluation. The `#` hash displays the offset-adjusted value.

```python
IntlMessageFormat(
"{guests, plural, offset:1 =0 {nobody} =1 {{host}} one {{host} and # other} other {{host} and # others}}", "en"
).format({"guests": 3, "host": "Alice"})
# "Alice and 2 others"
```

### Selectordinal
Selects a branch based on CLDR ordinal plural rules.

```python
IntlMessageFormat(
"{n, selectordinal, one {#st} two {#nd} few {#rd} other {#th}}", "en"
).format({"n": 23})
# "23rd"
```

### Select
Matches a string value to a branch key, falls back to `other`.

```python
IntlMessageFormat(
"{type, select, cat {meow} dog {woof} other {???}}", "en"
).format({"type": "cat"})
# "meow"
```

### Nested expressions
Plural inside select, select inside plural, variables inside branches — all work.

```python
IntlMessageFormat(
"{gender, select, male {He has {n, plural, one {# item} other {# items}}} other {They have {n, plural, one {# item} other {# items}}}}", "en"
).format({"gender": "male", "n": 1})
# "He has 1 item"
```

### Hash `#` replacement
Inside plural/selectordinal branches, `#` is replaced with the numeric value (locale-formatted with grouping separators).

```python
IntlMessageFormat("{n, plural, other {# items}}", "en").format({"n": 1000})
# "1,000 items"

IntlMessageFormat("{n, plural, other {# Artikel}}", "de").format({"n": 1000})
# "1.000 Artikel"
```

## Locale Support

Uses Babel's CLDR data for plural rules, covering 100+ locales. Tested against `icu4py` (ICU4C bindings) for correctness across:

- **English** (en) — one/other
- **French** (fr) — one/other (0 is "one")
- **German** (de) — one/other
- **Arabic** (ar) — zero/one/two/few/many/other
- **Russian** (ru) — one/few/many/other
- **Polish** (pl) — one/few/many/other
- **Japanese** (ja) — other only
- And all other locales supported by Babel/CLDR

## Known Carve-outs

- **Boolean values**: Python `True`/`False` are formatted as `"True"`/`"False"` (Python convention). ICU4C formats them as `1`/`0`.
- **Escape sequences**: `''` and `'{...}'` are unescaped during parsing (matching `@formatjs/icu-messageformat-parser` behavior). The formatted output contains the unescaped text.
21 changes: 21 additions & 0 deletions packages/generaltranslation-supported-locales/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,24 @@
<p align="center">
<a href="https://generaltranslation.com/docs/python">
<picture>
<source media="(prefers-color-scheme: light)" srcset="https://generaltranslation.com/brand/gt-logo-light.svg">
<img alt="General Translation" src="https://generaltranslation.com/brand/gt-logo-dark.svg" width="100" height="100">
</picture>
</a>
</p>

<p align="center">
<a href="https://generaltranslation.com/docs/python"><strong>Documentation</strong></a> · <a href="https://github.com/generaltranslation/gt-python/issues">Report Bug</a>
</p>

# generaltranslation-supported-locales

> ⚠️ **Experimental / Unstable** — This package is under active development and may be subject to breaking changes.

Locale validation and metadata for General Translation's Python packages.

## Installation

```bash
pip install generaltranslation-supported-locales
```
Loading
Loading