Skip to content

Conversation

@seizethedave
Copy link
Contributor

@seizethedave seizethedave commented Jun 22, 2024

This PR adds digit separators (1_000) to Jsonnet's numeric constants.

Accompanying issue with format proposal: #1155

@sbarzowski
Copy link
Contributor

I'm in favor of this. We need to also:

  1. Update the docs. At least . Ideallly also the tutorial.

See https://github.com/google/jsonnet?tab=readme-ov-file#locally-serving-the-website for working on documentation.

  1. Add some end-to-end examples. Just dump some .jsonnet files in test_suite and run https://github.com/google/jsonnet/blob/master/test_suite/refresh_golden.sh.

  2. We'll need to update go-jsonnet implementation (separate PR). It should be straightforward.

@seizethedave
Copy link
Contributor Author

Thanks @sbarzowski, I'll get cracking on those.

@seizethedave seizethedave marked this pull request as ready for review July 4, 2024 21:00
@seizethedave
Copy link
Contributor Author

@sbarzowski I think I'm ready for another round of feedback on this one. I suspect these might need a little more help:

  • a more formal treatment of the numeric grammar in the docs
  • note how std.parseInt and related functions will not honor numbers with underscores
    Thanks!

@seizethedave
Copy link
Contributor Author

Anything I can do to push this forward @sbarzowski? Thanks!

@johnbartholomew johnbartholomew self-assigned this Feb 23, 2025
@johnbartholomew
Copy link
Collaborator

Sorry for all the delays on this. I do hope to review this soon; it seems like a useful improvement.

@seizethedave
Copy link
Contributor Author

@johnbartholomew I'll try to find a few minutes to get this PR back into a healthy/mergable state. If you have any pre-review feedback or changes I can get those going.

core/lexer.cpp Outdated
case 'e':
case 'E': state = AFTER_E; break;

case '_': state = AFTER_UNDERSCORE; goto skip_char;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underscore must not be allowed after a leading zero.

JSON numbers that start with a 0 may only be 0 itself, or a fraction starting 0., or a (fairly pointless...) 0 with an exponent (e.g., 0e5). If you go to the AFTER_UNDERSCORE state after a leading zero then it can be used to produce an invalid number, e.g., 0_5 is read as 05 which is not allowed.

(The reason it's not allowed is because it's ambiguous - Javascript reads numbers with a leading zero as octal numbers, but JSON does not support octal numbers, so to avoid a misleading difference between JSON and Javascript leading zeros are simply forbidden)

core/lexer.cpp Outdated
case '6':
case '7':
case '8':
case '9': state = AFTER_ONE_TO_NINE; break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this is not correct. We need to maintain the distinction between state AFTER_ONE_TO_NINE (which is the state used in the integer part of the number) and state AFTER_DIGIT which is used after the decimal point. Having a single AFTER_UNDERSCORE state which transitions to AFTER_ONE_TO_NINE means that you allow an invalid repetition of the decimal point, e.g., this accepts 12.34_56.78_90.12 which should not be allowed.

I see you already handled this correctly for the exponent part, where you have a separate AFTER_EXP_UNDERSCORE state; you just need a similar AFTER_FRAC_UNDERSCORE state to correspond with AFTER_DIGIT.

(For what it's worth, the existing state names AFTER_ONE_TO_NINE and AFTER_DIGIT are really confusing!)

@johnbartholomew
Copy link
Collaborator

Well, "reviewing this soon" turned into ignoring it for an entire year 😭

This is basically very good and I should have merged it a long time ago! There are a couple of errors in the state machine, I believe, which I have commented on. But on a positive note, there have been almost no other changes to lexer.cpp in the meantime, so this change still rebases cleanly.

On the basis that this has been sitting in the queue for far, far too long already, I will go ahead and rebase this myself, add a couple of fixes, and merge it.

…tion

There are some cases which are a little strange but lexically valid.

- `1.2.3.4` lexically this tokenises as `1.2` DOT `3.4`, because a dot
  in the fractional or exponent part of a number is simply treated the
  same as any other possible terminating character (any character that
  isn't part of the valid number lexical syntax)
- `1e2.34` lexically is `1e2` DOT `34` (same as the first case)
- `1e2e34` lexically is `1e2` (number) `e34` (identifier)

These behaviours are basically preserved/extrapolated in the case of
digit separators, so for example `1_2.3_4.5_6` is lexically parsed
as `12.34` DOT `56`. And `1e2_3e4` is lexically parsed as
`1e23` (number), `e4` (identifier). These both look very confusing,
but it probably doesn't matter because those token sequences are,
I think, not valid syntactically so they'll just be rejected by
the parser.

Note that in JSON (and jsonnet), leading zeros are not allowed in
numeric literals. This behaviour is explicitly kept with digit
separators, so `0_5` is explicitly rejected. The alternatives are:

- Treat underscore after an initial zero the same as any terminator
  character, so `0_5` lexes as tokens `0` followed by identifier `_5`.
- Allow underscore, thereby breaking the no-leading-zeros rule, so
  `0_5` tokenises as `05`.

Either option seems confusing, hence it seems better to explicitly
reject an underscore after an initial zero.
@johnbartholomew
Copy link
Collaborator

Oh, @seizethedave I see you actually have seen the review comments already!

Would you prefer to do the updates yourself, or would you prefer that I add fixes? I'm happy to do either.

@seizethedave
Copy link
Contributor Author

Well, "reviewing this soon" turned into ignoring it for an entire year 😭

Hey, I'm familiar with that operating style.

@johnbartholomew I will take you up on the offer to rebase/fix/submit, thank you very much! Good catches on the state machine.

@johnbartholomew johnbartholomew merged commit f78b553 into google:master Jan 21, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants