-
Notifications
You must be signed in to change notification settings - Fork 470
Add digit separators #1160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add digit separators #1160
Conversation
|
I'm in favor of this. We need to also:
See https://github.com/google/jsonnet?tab=readme-ov-file#locally-serving-the-website for working on documentation.
|
|
Thanks @sbarzowski, I'll get cracking on those. |
|
@sbarzowski I think I'm ready for another round of feedback on this one. I suspect these might need a little more help:
|
|
Anything I can do to push this forward @sbarzowski? Thanks! |
|
Sorry for all the delays on this. I do hope to review this soon; it seems like a useful improvement. |
|
@johnbartholomew I'll try to find a few minutes to get this PR back into a healthy/mergable state. If you have any pre-review feedback or changes I can get those going. |
core/lexer.cpp
Outdated
| case 'e': | ||
| case 'E': state = AFTER_E; break; | ||
|
|
||
| case '_': state = AFTER_UNDERSCORE; goto skip_char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Underscore must not be allowed after a leading zero.
JSON numbers that start with a 0 may only be 0 itself, or a fraction starting 0., or a (fairly pointless...) 0 with an exponent (e.g., 0e5). If you go to the AFTER_UNDERSCORE state after a leading zero then it can be used to produce an invalid number, e.g., 0_5 is read as 05 which is not allowed.
(The reason it's not allowed is because it's ambiguous - Javascript reads numbers with a leading zero as octal numbers, but JSON does not support octal numbers, so to avoid a misleading difference between JSON and Javascript leading zeros are simply forbidden)
core/lexer.cpp
Outdated
| case '6': | ||
| case '7': | ||
| case '8': | ||
| case '9': state = AFTER_ONE_TO_NINE; break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this is not correct. We need to maintain the distinction between state AFTER_ONE_TO_NINE (which is the state used in the integer part of the number) and state AFTER_DIGIT which is used after the decimal point. Having a single AFTER_UNDERSCORE state which transitions to AFTER_ONE_TO_NINE means that you allow an invalid repetition of the decimal point, e.g., this accepts 12.34_56.78_90.12 which should not be allowed.
I see you already handled this correctly for the exponent part, where you have a separate AFTER_EXP_UNDERSCORE state; you just need a similar AFTER_FRAC_UNDERSCORE state to correspond with AFTER_DIGIT.
(For what it's worth, the existing state names AFTER_ONE_TO_NINE and AFTER_DIGIT are really confusing!)
|
Well, "reviewing this soon" turned into ignoring it for an entire year 😭 This is basically very good and I should have merged it a long time ago! There are a couple of errors in the state machine, I believe, which I have commented on. But on a positive note, there have been almost no other changes to lexer.cpp in the meantime, so this change still rebases cleanly. On the basis that this has been sitting in the queue for far, far too long already, I will go ahead and rebase this myself, add a couple of fixes, and merge it. |
…tion There are some cases which are a little strange but lexically valid. - `1.2.3.4` lexically this tokenises as `1.2` DOT `3.4`, because a dot in the fractional or exponent part of a number is simply treated the same as any other possible terminating character (any character that isn't part of the valid number lexical syntax) - `1e2.34` lexically is `1e2` DOT `34` (same as the first case) - `1e2e34` lexically is `1e2` (number) `e34` (identifier) These behaviours are basically preserved/extrapolated in the case of digit separators, so for example `1_2.3_4.5_6` is lexically parsed as `12.34` DOT `56`. And `1e2_3e4` is lexically parsed as `1e23` (number), `e4` (identifier). These both look very confusing, but it probably doesn't matter because those token sequences are, I think, not valid syntactically so they'll just be rejected by the parser. Note that in JSON (and jsonnet), leading zeros are not allowed in numeric literals. This behaviour is explicitly kept with digit separators, so `0_5` is explicitly rejected. The alternatives are: - Treat underscore after an initial zero the same as any terminator character, so `0_5` lexes as tokens `0` followed by identifier `_5`. - Allow underscore, thereby breaking the no-leading-zeros rule, so `0_5` tokenises as `05`. Either option seems confusing, hence it seems better to explicitly reject an underscore after an initial zero.
|
Oh, @seizethedave I see you actually have seen the review comments already! Would you prefer to do the updates yourself, or would you prefer that I add fixes? I'm happy to do either. |
Hey, I'm familiar with that operating style. @johnbartholomew I will take you up on the offer to rebase/fix/submit, thank you very much! Good catches on the state machine. |
b7aa8bd to
82ebe7d
Compare
This PR adds digit separators (
1_000) to Jsonnet's numeric constants.Accompanying issue with format proposal: #1155