wip: check cost of hashing in multi column aggregate [IGNORE] #19346

rluvaton · 2025-12-15T22:09:33Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

rluvaton · 2025-12-15T22:10:20Z

run benchmark aggregate_query_sql

rluvaton · 2025-12-15T22:52:12Z

show benchmark queue

alamb-ghbot · 2025-12-15T22:52:17Z

🤖 Hi @rluvaton, you asked to view the benchmark queue (#19346 (comment)).

Job	User	Benchmarks	Comment
`19287_3657520190.sh`	alamb	external_aggr	`https://github.com/apache/datafusion/pull/19287#issuecomment-3657520190`
`19344_3657763461.sh`	alamb	default	`https://github.com/apache/datafusion/pull/19344#issuecomment-3657763461`
`19346_3657820674.sh`	rluvaton	aggregate_query_sql	`https://github.com/apache/datafusion/pull/19346#issuecomment-3657820674`

Dandandan · 2025-12-16T07:31:28Z

run benchmark aggregate_query_sql

Dandandan · 2025-12-16T07:31:43Z

show benchmark queue

alamb-ghbot · 2025-12-16T07:31:47Z

🤖 Hi @Dandandan, you asked to view the benchmark queue (#19346 (comment)).

Job	User	Benchmarks	Comment
`19287_3657520190.sh`	alamb	external_aggr	`https://github.com/apache/datafusion/pull/19287#issuecomment-3657520190`
`19344_3657763461.sh`	alamb	default	`https://github.com/apache/datafusion/pull/19344#issuecomment-3657763461`
`19346_3657820674.sh`	rluvaton	aggregate_query_sql	`https://github.com/apache/datafusion/pull/19346#issuecomment-3657820674`
`19346_3659197246.sh`	Dandandan	aggregate_query_sql	`https://github.com/apache/datafusion/pull/19346#issuecomment-3659197246`

rluvaton · 2025-12-16T11:26:39Z

@alamb any idea why it's not working?

alamb · 2025-12-16T12:30:34Z

@alamb any idea why it's not working?

My script runner bails in error and there was something wrong with one of the jobs. I need to make it more resilent to errors and better error reporting

alamb-ghbot · 2025-12-16T12:42:40Z

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing check-cost-of-hashing (3aa62b0) to 58377bf diff
BENCH_NAME=aggregate_query_sql
BENCH_COMMAND=cargo bench --all-features --bench aggregate_query_sql
BENCH_FILTER=
BENCH_BRANCH_NAME=check-cost-of-hashing
Results will be posted here when complete

alamb · 2025-12-16T14:19:33Z

show benchmark queue

alamb-ghbot · 2025-12-16T14:19:38Z

🤖 Hi @alamb, you asked to view the benchmark queue (#19346 (comment)).

Job	User	Benchmarks	Comment
`19346_3657820674.sh`	rluvaton	aggregate_query_sql	`https://github.com/apache/datafusion/pull/19346#issuecomment-3657820674`
`19346_3659197246.sh`	Dandandan	aggregate_query_sql	`https://github.com/apache/datafusion/pull/19346#issuecomment-3659197246`
`19344_3660516858.sh`	Dandandan	tpch	`https://github.com/apache/datafusion/pull/19344#issuecomment-3660516858`
`19239_3660734795.sh`	alamb	default	`https://github.com/apache/datafusion/pull/19239#issuecomment-3660734795`

rluvaton · 2025-12-16T17:20:47Z

show benchmark queue

alamb-ghbot · 2025-12-16T17:20:51Z

🤖 Hi @rluvaton, you asked to view the benchmark queue (#19346 (comment)).

Job	User	Benchmarks	Comment
`19346_3657820674.sh`	rluvaton	aggregate_query_sql	`https://github.com/apache/datafusion/pull/19346#issuecomment-3657820674`
`19346_3659197246.sh`	Dandandan	aggregate_query_sql	`https://github.com/apache/datafusion/pull/19346#issuecomment-3659197246`
`19344_3660516858.sh`	Dandandan	tpch	`https://github.com/apache/datafusion/pull/19344#issuecomment-3660516858`
`19239_3660734795.sh`	alamb	default	`https://github.com/apache/datafusion/pull/19239#issuecomment-3660734795`
`19344_3660885262.sh`	alamb	default	`https://github.com/apache/datafusion/pull/19344#issuecomment-3660885262`

alamb · 2025-12-16T17:41:22Z

I just checked the runner. Whatever this benchmark is it is taking a long time to complete

The most recent one (note it says it is going to take 3379.1s, almost an hour!!! to run a single benchmark )

  8 (8.00%) high mild

Benchmarking aggregate_query_group_by_wide_u64_and_string_without_aggregate_expressions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 3379.1s, or reduce sample count to 10.
Benchmarking aggregate_query_group_by_wide_u64_and_string_without_aggregate_expressions: Collecting 100 samples in estimated 3379.1 s (100 iterations)

rluvaton · 2025-12-16T18:27:45Z

Probably this pr cause that I think.

rluvaton · 2025-12-16T18:28:05Z

Is it possible to cancel benchmark runs?

rluvaton · 2025-12-16T18:31:23Z

And is there a way to see the current job status to know for next time?

rluvaton · 2025-12-16T18:45:24Z

And is the run logs from main or this branch?

alamb · 2025-12-16T18:55:19Z

Is it possible to cancel benchmark runs?

I just did it manually. I don't have any automated way to do it yet

And is there a way to see the current job status to know for next time?

not that I know of yet. That would also be a great feature 🤔

Dandandan · 2025-12-16T20:45:55Z

I think the change probably makes it super slow by creating a lot of hash "collisions" (by only looking at the first column)

rluvaton · 2025-12-17T05:09:25Z

I wanted to see for group by wide u64 and string how much saving the string hashing would save us as it's irrelevant - in wide u64 case all values are unique, so you don't need to hash by string.

So yes, I was aware that I would create hash collisions for the rest of the benchmarks (I didn't know it will be that slow though). What surprise me is that the group by wide was the one to take a long time which I explained shouldn't be as it's the same as hashing by a single column

rluvaton · 2025-12-29T13:18:39Z

we can check if useful after:

Add to our benchmarks Time Series Benchmark Suite by TimeScale #19525

wip: check cost of hashing

3aa62b0

github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 15, 2025

rluvaton closed this Dec 29, 2025

wip: check cost of hashing in multi column aggregate [IGNORE] #19346

wip: check cost of hashing in multi column aggregate [IGNORE] #19346

Uh oh!

Conversation

rluvaton commented Dec 15, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rluvaton commented Dec 15, 2025

Uh oh!

rluvaton commented Dec 15, 2025

Uh oh!

alamb-ghbot commented Dec 15, 2025

Uh oh!

Dandandan commented Dec 16, 2025

Uh oh!

Dandandan commented Dec 16, 2025

Uh oh!

alamb-ghbot commented Dec 16, 2025

Uh oh!

rluvaton commented Dec 16, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

alamb-ghbot commented Dec 16, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

alamb-ghbot commented Dec 16, 2025

Uh oh!

rluvaton commented Dec 16, 2025

Uh oh!

alamb-ghbot commented Dec 16, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

rluvaton commented Dec 16, 2025

Uh oh!

rluvaton commented Dec 16, 2025

Uh oh!

rluvaton commented Dec 16, 2025

Uh oh!

rluvaton commented Dec 16, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

Dandandan commented Dec 16, 2025

Uh oh!

rluvaton commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rluvaton commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rluvaton commented Dec 17, 2025 •

edited

Loading