feat(flagd): introduce fatalStatusCodes option #1624

leakonvalinka · 2025-11-10T14:11:06Z

This PR

adds the fatalStatusCode option + env variable

Related Issues

Notes

I'm not too happy with how the fatal error is communicated through the different components (received at SyncStreamQueueSource -> FlagStore -> InProcessResolver -> FlagdProvider, respective RpcResolver -> FlagdProvider). It "misuses" the STALE state to differentiate between normal errors and fatal errors. I couldn't find a cleaner solution for this unfortunately, so feedback on this would be highly appreciated!

Will work on the remaining failing tests once we agree on how to proceed!

Follow-up Tasks

How to test

...lagd/src/test/java/dev/openfeature/contrib/providers/flagd/e2e/steps/config/ConfigSteps.java

...ers/flagd/src/test/java/dev/openfeature/contrib/providers/flagd/e2e/steps/ProviderSteps.java

providers/flagd/src/test/java/dev/openfeature/contrib/providers/flagd/e2e/steps/Utils.java

providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/FlagdOptions.java

Signed-off-by: lea konvalinka <[email protected]>

Signed-off-by: Konvalinka <[email protected]>

Signed-off-by: lea konvalinka <[email protected]>

Signed-off-by: Konvalinka <[email protected]>

chrfwow

Since we do not want to introduce breaking changes into the api by adding a PROVIDER_FATAL type to ProviderEvent, I have two suggestions how we might be able to work around the "misuse" of the stale event:
We could add a isFatal flag to the FlagdProviderEvent to track the type of error. I don't really like it because this could also be set when the event is not an error event, and with this we split up information that should be stored in one place into two places.
Or, we create an enum class ExtendedProviderEvent, which is a copy of ProviderEvent (enums cannot be extended in Java), plus the additional PROVIDER_FATAL field. We would then have to map where needed between the two types (not 100% sure if this will work). I don't like this either, because we would duplicate the ProviderEvent enum

chrfwow · 2025-12-17T11:39:09Z

...e/contrib/providers/flagd/resolver/process/storage/connector/sync/SyncStreamQueueSource.java

    private final BlockingQueue<QueuePayload> outgoingQueue = new LinkedBlockingQueue<>(QUEUE_SIZE);
    private final FlagSyncServiceStub flagSyncStub;
    private final FlagSyncServiceBlockingStub metadataStub;
+    private final List<String> fatalStatusCodes;


Since we do lots of .contains operation on this data structure, a HashSet might be more performant. How many entries do we expect in this list?

That's hard for me to estimate, what do the others think? The currently defined default is an empty list

...rs/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/resolver/rpc/RpcResolver.java

chrfwow · 2025-12-17T11:45:30Z

providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/Config.java

+                    .map(String::trim)
+                    .collect(Collectors.toList()) : defaultValue;
+        } catch (Exception e) {
+            return defaultValue;


We should print an info/warn that the env vars are invalid

Just for this method? Or the other ones too? I'd either leave it or add it in all cases to be consistent

Then we should add it everywhere, but in a different PR

Alright, sounds good. Should we create a new issue for this or is that overkill?

providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/FlagdOptions.java

providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/FlagdProvider.java

guidobrei · 2025-12-17T12:53:05Z

I'm not too happy with how the fatal error is communicated through the different components (received at SyncStreamQueueSource -> FlagStore -> InProcessResolver -> FlagdProvider...)

This is an implication of our provider design and there is not really something to do about that (in this PR).

toddbaert · 2025-12-18T20:09:17Z

...rc/main/java/dev/openfeature/contrib/providers/flagd/resolver/process/storage/FlagStore.java

                    }
                    break;
                case ERROR:
+                    if (!stateBlockingQueue.offer(new StorageStateChange(StorageState.STALE))) {


Could we simply add a FATAL storage state to resolve this conceptual "STALE" overloading? This is an entirely private enum, so we can add to it without issue.

Yes, we could, but this would only solve the missuse issue in the communication step from FlagStore -> InProcessResolver (and not InProcessResolver -> FlagdProvider)
Also, nit pick: StorageState.ERROR is already defined as /** Storage is in an unrecoverable error stage. */, which models what FATAL means for us i think

Based on the StorageState existing docs the states are used now as intended:

STALE: Storage has gone stale (most recent sync failed). May get to OK status with next sync.

QueueSource encountered an ERROR but will try to recover.

ERROR: Storage is in an unrecoverable error stage.

QueueSource encountered an ERROR AND will not try to recover. It exited the sync loop.

We fixed this by propagating the provider details as well though the consumer, so we don't need to overload or abuse the enums anymore.

toddbaert · 2025-12-18T20:14:27Z

...e/contrib/providers/flagd/resolver/process/storage/connector/sync/SyncStreamQueueSource.java


    private final AtomicBoolean shutdown = new AtomicBoolean(false);
    private final AtomicBoolean shouldThrottle = new AtomicBoolean(false);
+    private final AtomicBoolean successfulSync = new AtomicBoolean(false);


My biggest question with this whole concept (not your implementation) is whether or not we should care about whether this is the initial sync or not. I'm actually leaning towards "not"... here is my reasoning (anyone feel free to disagree):

users already fully control what's considered FATAL; they can also control whether or not to consider FATAL at init different than FATAL later, using event handlers and the details of the exception

it's simpler (less conditions/state to handler in our code [this field would disappear] and for users to understand)

WDYT?

If it's easy to get the same behaviour through event handlers, I think that might be better, because it allows for more customization. I get both sides, that one might not want to completely shut down if a valid flag config was previously received, but also that one might not want to work with stale data given that a non-transient error was received

I agree with @toddbaert that fatality does not depend on being the first sync or not.

However, fatality from the SyncStreamQueueSource's perspective depends on whether we break out the sync loop and stop the sync thread (FATAL ERROR), or continue to reconnect (ERROR).

toddbaert · 2025-12-18T20:20:02Z

Since we do not want to introduce breaking changes into the api by adding a PROVIDER_FATAL type to ProviderEvent, I have two suggestions how we might be able to work around the "misuse" of the stale event: We could add a isFatal flag to the FlagdProviderEvent to track the type of error. I don't really like it because this could also be set when the event is not an error event, and with this we split up information that should be stored in one place into two places. Or, we create an enum class ExtendedProviderEvent, which is a copy of ProviderEvent (enums cannot be extended in Java), plus the additional PROVIDER_FATAL field. We would then have to map where needed between the two types (not 100% sure if this will work). I don't like this either, because we would duplicate the ProviderEvent enum

I could be missing something, but I don't think this is an issue. The "fatalness" (fatality?) of an event is not communicated by the event type, but the error code associated with the event: https://github.com/open-feature/java-sdk/blob/main/src/main/java/dev/openfeature/sdk/ProviderEventDetails.java#L16. All error events are events. Some error events contain with FATAL error codes.

This is basically what's in Go, as well, right @alexandraoberaigner ?

Signed-off-by: Konvalinka <[email protected]>

Signed-off-by: lea konvalinka <[email protected]>

Signed-off-by: Konvalinka <[email protected]>

leakonvalinka · 2025-12-19T11:19:04Z

I could be missing something, but I don't think this is an issue. The "fatalness" (fatality?) of an event is not communicated by the event type, but the error code associated with the event: https://github.com/open-feature/java-sdk/blob/main/src/main/java/dev/openfeature/sdk/ProviderEventDetails.java#L16. All error events are events. Some error events contain with FATAL error codes.

The issue is that the onProviderEvent method expects a FlagdProviderEvent, which doesn't allow a differentiation between different error types (based on the ProviderEvent enum), but we need to tell apart "normal" errors from fatal ones. My implementation does this by its STALE misuse, which isn't a super clean approach. @chrfwow's suggestions would be 2 alternatives, but also not 100% clean.

Personally I think I prefer adding a FATAL to the StorageState enum (even though this nit pick here) and adding an isFatal flag to the FlagdProviderEvent, but as this also has its disadvantages, so I'd like to hear other opinions as well! :)

Signed-off-by: Konvalinka <[email protected]>

alexandraoberaigner · 2025-12-22T09:12:22Z

I could be missing something, but I don't think this is an issue. The "fatalness" (fatality?) of an event is not communicated by the event type, but the error code associated with the event: https://github.com/open-feature/java-sdk/blob/main/src/main/java/dev/openfeature/sdk/ProviderEventDetails.java#L16. All error events are events. Some error events contain with FATAL error codes.
This is basically what's in Go, as well, right @alexandraoberaigner ?

Yes, in Go its a provider event with error code: see ProviderInitError with ErrorCode including PROVIDER_FATAL

Just found a test in the java-sdk that says how the fatal error should look like: see here

ProviderEvent.PROVIDER_ERROR,
ProviderEventDetails.builder()
                        .errorCode(ErrorCode.PROVIDER_FATAL)
                        .build()

leakonvalinka · 2025-12-22T10:01:20Z

Just found a test in the java-sdk that says how the fatal error should look like: see here
ProviderEvent.PROVIDER_ERROR,
ProviderEventDetails.builder()
                       .errorCode(ErrorCode.PROVIDER_FATAL)
                       .build()

Yup, I also implemented it like this when the FlagdProvider emits the fatal error here
But that doesnt change anything regarding the question how to cleanly propagate the error through the layers until it reaches the FlagdProvider class, unless I am missing something

Signed-off-by: Guido Breitenhuber <[email protected]>

...main/java/dev/openfeature/contrib/providers/flagd/resolver/process/storage/StorageState.java

Signed-off-by: Todd Baert <[email protected]>

toddbaert

I made a few small test fixes.

Looks good to me, and FATAL works as expected. One thing to note is that as discussed in various places, we DO NOT act differently on FATAL status codes depending on whether this is the initial connection or not; if the stream sends a FATAL code at any time, the provider transitions to FATAL.

See: open-feature/flagd#1818 (comment)

Signed-off-by: Todd Baert <[email protected]>

toddbaert · 2025-12-22T20:09:43Z

...ers/flagd/src/test/java/dev/openfeature/contrib/providers/flagd/e2e/steps/ProviderSteps.java

    @Then("the client should be in {} state")
    public void the_client_should_be_in_fatal_state(String clientState) {
-        assertThat(state.client.getProviderState()).isEqualTo(ProviderState.valueOf(clientState.toUpperCase()));
+        await().pollDelay(100, TimeUnit.MILLISECONDS)


I had to make this test tolerate a small retry/timeout here (it worked most of the time but was flaky for both in-process and RPC)

I think that there's a small lag time between when we fire events, and when the SDK updates the client status. We should consider whether this is a small bug with the SDK, and if we should guarantee that the state is updated before the event handlers run (or not)

@guidobrei @aepfli @leakonvalinka

github-actions bot assigned beeme1mr, Kavindu-Dodan, thisthat and toddbaert Nov 10, 2025

github-actions bot requested review from Kavindu-Dodan, beeme1mr, thisthat and toddbaert November 10, 2025 14:11

aepfli reviewed Nov 10, 2025

View reviewed changes

...lagd/src/test/java/dev/openfeature/contrib/providers/flagd/e2e/steps/config/ConfigSteps.java Outdated Show resolved Hide resolved

aepfli reviewed Nov 10, 2025

View reviewed changes

...ers/flagd/src/test/java/dev/openfeature/contrib/providers/flagd/e2e/steps/ProviderSteps.java Outdated Show resolved Hide resolved

aepfli reviewed Nov 10, 2025

View reviewed changes

providers/flagd/src/test/java/dev/openfeature/contrib/providers/flagd/e2e/steps/Utils.java Outdated Show resolved Hide resolved

aepfli reviewed Nov 10, 2025

View reviewed changes

providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/FlagdOptions.java Outdated Show resolved Hide resolved

aepfli mentioned this pull request Nov 11, 2025

feat: add missing steps for config and improve wording open-feature/flagd-testbed#311

Merged

leakonvalinka force-pushed the fix/flagd-infinite-connection-retries branch from 6f89ff0 to 057751b Compare November 12, 2025 10:01

leakonvalinka added 2 commits November 20, 2025 12:48

fix(flagd): no retry for certain error codes, implement test steps

8b7f574

Signed-off-by: lea konvalinka <[email protected]>

attempt to handle fatal error

f0a1db2

Signed-off-by: lea konvalinka <[email protected]>

leakonvalinka force-pushed the fix/flagd-infinite-connection-retries branch from f7f1d97 to f0a1db2 Compare November 20, 2025 12:22

leakonvalinka added 5 commits November 24, 2025 10:49

fix(flagd): update testbed + step, fix event

654c8da

Signed-off-by: lea konvalinka <[email protected]>

adjust rpc resolver

07195a7

Signed-off-by: lea konvalinka <[email protected]>

Merge branch 'main' into fix/flagd-infinite-connection-retries

ccf5120

fix e2e tests

e6d4057

Signed-off-by: Konvalinka <[email protected]>

Merge branch 'main' into fix/flagd-infinite-connection-retries

75392e6

Signed-off-by: lea konvalinka <[email protected]>

leakonvalinka changed the title ~~fix(flagd): no retry for certain error codes, implement test steps~~ feat(flagd): introduce fatalStatusCodes option Dec 17, 2025

leakonvalinka added 2 commits December 17, 2025 10:50

clean up

95a880c

Signed-off-by: Konvalinka <[email protected]>

fatal only on first connection

45a9822

Signed-off-by: Konvalinka <[email protected]>

leakonvalinka marked this pull request as ready for review December 17, 2025 10:34

leakonvalinka requested a review from a team as a code owner December 17, 2025 10:34

remove exclusion of sync e2e test tag

e50aa7f

Signed-off-by: Konvalinka <[email protected]>

chrfwow reviewed Dec 17, 2025

View reviewed changes

toddbaert reviewed Dec 18, 2025

View reviewed changes

toddbaert requested a review from alexandraoberaigner December 18, 2025 20:20

leakonvalinka added 3 commits December 19, 2025 10:25

add shutdown after fatal, fix tests

a636257

Signed-off-by: Konvalinka <[email protected]>

Merge branch 'main' into fix/flagd-infinite-connection-retries

5794b1a

Signed-off-by: lea konvalinka <[email protected]>

remove shutdown

d27e4e9

Signed-off-by: Konvalinka <[email protected]>

leakonvalinka force-pushed the fix/flagd-infinite-connection-retries branch from 341d1e1 to 94c7691 Compare December 19, 2025 11:02

fix lint issues

94c7691

Signed-off-by: Konvalinka <[email protected]>

fix spotless

ee42405

Signed-off-by: Konvalinka <[email protected]>

feat(flagd): Communicate Fatal and shutdown connectors

02539d0

Signed-off-by: Guido Breitenhuber <[email protected]>

aepfli reviewed Dec 22, 2025

View reviewed changes

...main/java/dev/openfeature/contrib/providers/flagd/resolver/process/storage/StorageState.java Outdated Show resolved Hide resolved

toddbaert added 3 commits December 22, 2025 13:41

fixup: update tests

4e32125

Signed-off-by: Todd Baert <[email protected]>

fixup: revert rpc test expectations

701069a

Signed-off-by: Todd Baert <[email protected]>

fixup: revert enum change

bdc3e68

Signed-off-by: Todd Baert <[email protected]>

toddbaert approved these changes Dec 22, 2025

View reviewed changes

toddbaert force-pushed the fix/flagd-infinite-connection-retries branch from 60a282f to 9146dc5 Compare December 22, 2025 19:45

fixup: test timeout

a969488

Signed-off-by: Todd Baert <[email protected]>

toddbaert force-pushed the fix/flagd-infinite-connection-retries branch from 9146dc5 to a969488 Compare December 22, 2025 20:06

toddbaert reviewed Dec 22, 2025

View reviewed changes

toddbaert requested review from aepfli and chrfwow December 22, 2025 20:17

feat(flagd): introduce fatalStatusCodes option #1624

Are you sure you want to change the base?

feat(flagd): introduce fatalStatusCodes option #1624

Uh oh!

Conversation

leakonvalinka commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR

Related Issues

Notes

Follow-up Tasks

How to test

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrfwow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leakonvalinka Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

guidobrei commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leakonvalinka Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddbaert commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leakonvalinka commented Dec 19, 2025

Uh oh!

alexandraoberaigner commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leakonvalinka commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

toddbaert left a comment

Choose a reason for hiding this comment

Uh oh!

toddbaert Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

leakonvalinka commented Nov 10, 2025 •

edited

Loading

leakonvalinka Dec 17, 2025 •

edited

Loading

guidobrei commented Dec 17, 2025 •

edited

Loading

leakonvalinka Dec 19, 2025 •

edited

Loading

toddbaert commented Dec 18, 2025 •

edited

Loading

alexandraoberaigner commented Dec 22, 2025 •

edited

Loading

leakonvalinka commented Dec 22, 2025 •

edited

Loading

toddbaert Dec 22, 2025 •

edited

Loading