Skip to content

tensorstore::Open with create | delete_existing intermittently fails ALREADY_EXISTS on GCS #290

@cnsgsz

Description

@cnsgsz

Setup

  • driver: zarr (kvstore: gcs)
  • OpenMode: create | delete_existing
  • A single client process opens the store once for write at job startup. No
    concurrent writers on the same path.

Symptom

tensorstore::Open intermittently returns ALREADY_EXISTS on the .zarray
write, even though delete_existing=true should remove any prior state
first. We see it on a small fraction of paths in long batch runs (~6 out of
several thousand). Subsequent retries of the same path (seconds later)
succeed, and the same code path normally succeeds.

ALREADY_EXISTS: Error opening "zarr" driver: Error writing
gs://.../foo.zarr/.zarray
[source locations='tensorstore/internal/cache/kvs_backed_cache.h:220
tensorstore/driver/driver.cc:115']

Likely cause

A timing window inside delete_existing's implementation: the delete of
.zarray is acknowledged, but the create's existence check sees stale
metadata and fails. Plausibly a metadata-cache / consistency window in the
GCS kvstore layer.

Suggestion

Either (a) make delete_existing internally retry on ALREADY_EXISTS for a
short window, or (b) document that callers should retry. Right now the
failure mode looks like a code bug — create | delete_existing should be
atomic from the caller's perspective — rather than a transient
remote-storage hiccup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions