fix: accept GeoJSON strings for Edm.GeographyPoint in AzureSearchWriter#2556
fix: accept GeoJSON strings for Edm.GeographyPoint in AzureSearchWriter#2556chon3806 wants to merge 3 commits intomicrosoft:masterfrom
Conversation
…er (microsoft#2420) Azure AI Search expects spatial values as GeoJSON objects, but when users supplied a StringType column the writer JSON-escaped the entire string and the service rejected the request with HTTP 400. Convert string GeographyPoint columns into the canonical struct<type, coordinates> shape via from_json before serialization, mirroring the existing Edm.DateTimeOffset handling. Existing struct-based input is unchanged.
|
Hey @chon3806 👋! We use semantic commit messages to streamline the release process. Examples of commit messages with semantic prefixes:
To test your commit locally, please follow our guild on building from source. |
There was a problem hiding this comment.
Pull request overview
This PR fixes Azure AI Search ingestion failures when AzureSearchWriter is given Edm.GeographyPoint values in StringType columns containing GeoJSON by parsing those strings into the expected struct shape before JSON serialization.
Changes:
- Added a private normalization step to parse
StringTypeGeoJSON intostruct<type:string, coordinates:array<double>>forEdm.GeographyPointindex fields. - Wired the new normalization into
AzureSearchWriter.prepareDFbefore schema parity checks and request serialization. - Added an end-to-end Scala test covering GeographyPoint values supplied as strings.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| cognitive/src/main/scala/com/microsoft/azure/synapse/ml/services/search/AzureSearch.scala | Adds GeoJSON string → struct conversion for Edm.GeographyPoint fields and applies it during DF preparation. |
| cognitive/src/test/scala/com/microsoft/azure/synapse/ml/services/search/split2/SearchWriterSuitePart2.scala | Adds an end-to-end test that writes docs with GeoJSON GeographyPoint stored as strings. |
| currentDF.schema(fieldName).dataType match { | ||
| case StringType => | ||
| currentDF.withColumn(fieldName, | ||
| when(col(fieldName).isNotNull, from_json(col(fieldName), geoStructType)) |
There was a problem hiding this comment.
from_json defaults to permissive parsing, so malformed/invalid GeoJSON strings will be converted to null silently. That can lead to documents being ingested with a missing location (or unexpected null updates) instead of surfacing an error. Consider using from_json with mode=FAILFAST (or otherwise explicitly validating the parse result) so invalid GeoJSON fails loudly rather than being dropped.
| when(col(fieldName).isNotNull, from_json(col(fieldName), geoStructType)) | |
| when( | |
| col(fieldName).isNotNull, | |
| from_json(col(fieldName), geoStructType, Map("mode" -> "FAILFAST")) | |
| ) |
There was a problem hiding this comment.
Done in 95a7ae1 -- switched to from_json(..., Map("mode" -> "FAILFAST")) and added a unit test (convertGeographyPointToStruct fails fast on malformed GeoJSON instead of silently nulling) that asserts a SparkException on materialization.
| * For each field declared as `Edm.GeographyPoint` in the index, if the corresponding | ||
| * DataFrame column is a `StringType`, parse it into the canonical | ||
| * `struct<type:string, coordinates:array<double>>` so that downstream `to_json` | ||
| * emits a proper GeoJSON object. Columns that are already structured are left as-is. | ||
| * | ||
| * @param df DataFrame with potential GeographyPoint columns | ||
| * @param indexJson JSON string containing the index schema | ||
| * @return DataFrame with string GeographyPoint columns converted to GeoJSON structs | ||
| */ | ||
| private def convertGeographyPointToStruct(df: DataFrame, indexJson: String): DataFrame = { | ||
| val geoStructType = StructType(Seq( | ||
| StructField("type", StringType), | ||
| StructField("coordinates", ArrayType(DoubleType)) | ||
| )) | ||
| val geoFields = parseIndexJson(indexJson).fields | ||
| .filter(_.`type` == "Edm.GeographyPoint") | ||
| .map(_.name) | ||
| geoFields.foldLeft(df) { (currentDF, fieldName) => |
There was a problem hiding this comment.
The Scaladoc says this runs "for each field declared as Edm.GeographyPoint in the index", but the implementation only inspects parseIndexJson(indexJson).fields (top-level fields) and won’t convert GeographyPoint values nested inside complex types. Either clarify the doc to state it’s top-level only, or extend the traversal to nested fields so behavior matches the comment.
There was a problem hiding this comment.
Done in 95a7ae1 -- updated the Scaladoc to state '''top-level''' explicitly and noted that the scope mirrors convertDateTimeToISO8601 (which has the same top-level-only limitation). Extending nested-field traversal felt out-of-scope for this bug fix; happy to do it as a follow-up if preferred.
| retryWithBackoff(assertSize(in, 2)) | ||
|
|
There was a problem hiding this comment.
This test only asserts document count after the write. If GeoJSON parsing were to fail and produce null (or otherwise lose the spatial payload), Azure Search could still ingest the documents and this test would pass. To make the test validate the intended behavior, consider fetching the stored documents (or querying/selecting location) and asserting the location field is present and shaped as a GeoJSON object (e.g., has type and coordinates).
There was a problem hiding this comment.
Addressed in 95a7ae1. Two complementary guarantees now make the count check meaningful:
- The new FAILFAST parsing mode raises a
SparkExceptionon malformed GeoJSON before the request is even built (covered byconvertGeographyPointToStruct fails fast on malformed GeoJSON ...). AzureSearchWriter.writeruns withfatalErrors=trueby default, so any 400 from the service throws aRuntimeExceptionand fails the test beforeassertSizeis reached.
Combined with the new unit tests that assert the converted struct shape and parsed coordinates directly, a count of 2 is only achievable if the documents were accepted as valid spatial objects. The repo's existing search tests follow the same assertSize-only pattern and there's no helper for fetching individual documents, so I kept the e2e style consistent.
…or GeographyPoint conversion - Replace unicode em-dash with ASCII to avoid encoding/scalastyle surprises. - Replace 'microsoft#2420' in Scaladoc (member-reference syntax) with a full URL. - Make convertGeographyPointToStruct private[ml] so it can be exercised by in-tree unit tests without requiring live Azure Search credentials. - Add two non-network unit tests covering the string->struct rewrite, null preservation, and the no-op path for already-structured columns.
@microsoft-github-policy-service agree |
- Use Spark's FAILFAST parsing mode in from_json so malformed GeoJSON surfaces an explicit exception instead of being silently coerced to null and shipped to Azure Search. - Clarify Scaladoc to state the conversion is top-level-only (mirroring convertDateTimeToISO8601), and document the FAILFAST behavior. - Add a unit test asserting malformed GeoJSON fails fast at materialization. - Document why the end-to-end test's count assertion is sufficient: with fatalErrors=true (default) any service-side rejection throws, so a passing count proves the documents were accepted as valid spatial objects.
|
Hi @BrendanWalsh — friendly nudge on this one when you have a moment. The CLA is signed, the semantic title check passes, and the Copilot review's 3 comments have all been addressed in 95a7ae1. CI hasn't run yet (fork PR — needs a maintainer to approve workflows). Happy to address any further feedback. Thanks! |
Summary
Fixes #2420.
AzureSearchWriterfailed with400 Bad Requestwhen a user provided anEdm.GeographyPointvalue as aStringTypecolumn containing GeoJSON. The string was JSON-escaped during serialization, so Azure AI Search received a quoted string instead of a spatial object:Fix
Added
convertGeographyPointToStructinAzureSearch.scala, mirroring the existingconvertDateTimeToISO8601handling. For every top-level index field declared asEdm.GeographyPoint, if the corresponding DataFrame column is aStringType, it is parsed viafrom_json(withmode = FAILFAST) into the canonicalstruct<type: string, coordinates: array<double>>beforecheckSchemaParityandto_json. Struct-based inputs continue to work unchanged.FAILFASTwas chosen so malformed GeoJSON surfaces as a loudSparkExceptionat row materialization rather than silently nulling out the field (the defaultPERMISSIVEbehavior offrom_json). This is consistent withAzureSearchWriter.writedefaulting tofatalErrors = true.Tests
Added in
SearchWriterSuitePart2.scala:Handle GeoJSON GeographyPoint fields supplied as strings— writes documents with aStringTypelocation column and verifies the index ingests them (any400would throw viafatalErrors = true).convertGeographyPointToStruct parses GeoJSON strings into structsconvertGeographyPointToStruct leaves struct columns untouchedconvertGeographyPointToStruct fails fast on malformed GeoJSON instead of silently nullingCompatibility
private[ml]) — no public API changes.Review feedback addressed
All 3 Copilot review comments have been addressed in
95a7ae1:from_jsontoFAILFASTmode so malformed GeoJSON is not silently nulled.convertDateTimeToISO8601).