-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Enables DefaultListFilesCache by default #19366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
10b740d to
73c6672
Compare
- Sets the DefaultListFilesCache to be enabled by default - Adds additional object store access tests to show list caching behavior - Adds variable setting/reading sqllogic test cases - Updates tests to disable caching when they relied on COPY commands so changes can be detected for each query - Updates docs to help users upgrade
| You can configure the maximum cache size and cache entry expiration time via configuration options: | ||
|
|
||
| See <https://github.com/apache/datafusion/issues/19056> for more details. | ||
| `datafusion.runtime.list_files_cache_limit` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `datafusion.runtime.list_files_cache_limit` | |
| * `datafusion.runtime.list_files_cache_limit` |
|
|
||
| See <https://github.com/apache/datafusion/issues/19056> for more details. | ||
| `datafusion.runtime.list_files_cache_limit` | ||
| `datafusion.runtime.list_files_cache_ttl` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `datafusion.runtime.list_files_cache_ttl` | |
| * `datafusion.runtime.list_files_cache_ttl` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* made prettiercomplain so I had to go for - instead. In either case these are formatting better now, thanks!
| You can configure the maximum cache size and cache entry expiration time via configuration options: | ||
|
|
||
| See <https://github.com/apache/datafusion/issues/19056> for more details. | ||
| `datafusion.runtime.list_files_cache_limit` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those do not render well at the moment - https://github.com/BlakeOrth/datafusion/blob/73c667216ee2fcb85196694cc5a36633f9928c19/docs/source/library-user-guide/upgrading.md#listingtableprovider-now-caches-list-commands
Also it would be good to mention the units of their values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the technical "unit" and a quick explanation for both of these, but since the actual user input is a string that undergoes parsing (e.g. "1m30s") I also linked out to the user guide to direct users to the more detailed information.
| `datafusion.runtime.list_files_cache_limit` | ||
| `datafusion.runtime.list_files_cache_ttl` | ||
|
|
||
| Caching can be disable by setting the limit to 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Caching can be disable by setting the limit to 0: | |
| Caching can be disabled by setting the limit to 0: |
| )); | ||
| Some(lfc) | ||
| } | ||
| _ => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user has disabled caching with list_files_cache_limit = "0K" then None will be returned here, but in this case get_list_files_cache_limit() will return "1M"
766ccbe to
da08e25
Compare
- Fixes erroneous cache limit value being returned if the user has disabled the cache
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @BlakeOrth -- this looks great to me and does what we set out to do -- which is make a session local list files cache.
I tested it locally and you can see the second query doesn't make any object store requests
Here is what it looks like on this branch:
DataFusion CLI v51.0.0
> \object_store_profiling summary
ObjectStore Profile mode set to Summary
> CREATE EXTERNAL TABLE maps STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure/';
0 row(s) fetched.
Elapsed 6.369 seconds.
Object Store Profiling
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----------+-----------+------------+------------+-------+
| Operation | Metric | min | max | avg | sum | count |
+-----------+----------+-----------+-----------+------------+------------+-------+
| Get | duration | 0.104099s | 0.455794s | 0.237693s | 5.229248s | 22 |
| Get | size | 524288 B | 857230 B | 677976.1 B | 14915475 B | 22 |
| List | duration | 0.377573s | 0.377573s | 0.377573s | 0.377573s | 1 |
| List | size | | | | | 1 |
+-----------+----------+-----------+-----------+------------+------------+-------+
> drop table maps;
0 row(s) fetched.
Elapsed 0.002 seconds.
Object Store Profiling
> CREATE EXTERNAL TABLE maps STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure/';
0 row(s) fetched.
Elapsed 0.657 seconds.
Object Store Profiling
>(note there are no object store requests on the second call to CREATE EXTERNAL TABLE)
Whereas on main there is a second call
Object Store Profiling
> CREATE EXTERNAL TABLE maps STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure/';
0 row(s) fetched.
Elapsed 1.155 seconds.
Object Store Profiling
Instrumented Object Store: instrument_mode: Summary, inner: AmazonS3(overturemaps-us-west-2)
Summaries:
+-----------+----------+-----+-----+-----+-----+-------+
| Operation | Metric | min | max | avg | sum | count |
+-----------+----------+-----+-----+-----+-----+-------+
| List | duration | | | | | 3 |
| List | size | | | | | 3 |
+-----------+----------+-----+-----+-----+-----+-------+COncernsl
As I played around with this, it was somewhat akward. Maybe we want to adjust the behavior somehow --
Confusing aspects:
- I can't set a TTL to 0
Object Store Profiling
> SET datafusion.runtime.list_files_cache_ttl = '0s';
Error during planning: Duration must be greater than 0 seconds- Even when I set a ttl, the cache didn't reset
> SET datafusion.runtime.list_files_cache_ttl = "1s";
0 row(s) fetched.
Elapsed 0.002 seconds.
-- wait over a second
-- cache is still used (creates really fast with no requests)
> CREATE EXTERNAL TABLE maps STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure/';
0 row(s) fetched.
Elapsed 0.647 seconds.
Object Store ProfilingI wonder if it would make more sense to scope the cache to a particular table somehow (kind of like is done today, but still impose a global limit / visibility)
| those changes until the `ListingTableProvider` instance is dropped and recreated. | ||
|
|
||
| You will be able to configure the maximum cache size and cache expiration time via a configuration option: | ||
| You can configure the maximum cache size and cache entry expiration time via configuration options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This was an intentional decision implemented in #19108 because setting a TTL of 0 conceptually doesn't make much sense to me because it would mean that all cache entries would immediately expire. Functionally it would allow the cache to consume memory and never hit, which also seems undesirable. That being said, the fact that it caused confusion probably means we can improve the user experience in some aspects. What would you expect a TTL of 0 to do?
This does seem odd. I think what you have shown here should be working as you expect and I'm concerned there might be a bug with timing/expiration that's working in unit tests but not in regular usage. I will devote some time to looking into this! |
I suggest we merge this PR as is to unblock the work to display the contents of the cache. We can file follow on tickets to refine the behavior afterwards. Is that acceptable to you @BlakeOrth ? |
@alamb Yep, that works for me! Would we want to run benchmarks against this PR since it's performance related? |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
I think these benchmark results are generally in line with my expectations for a benchmark running against local files. Some meaningful improvement on queries that already run quickly (single milliseconds range) and generally no change on anything where file access, which are already pretty fast, is an overall small proportion of the work that needs to be done. It looks like most of the reported slow downs are on the long-running queries. I can't really imagine this work having any meaningful effect on those queries; Considering file listing happens at the beginning of the query processing I'm inclined to say that may be run-to-run variation more than this work actually slowing things down. |
|
Yes -- thank you @BlakeOrth -- the more I have been thinking about this PR the more I am worried about the behavior of -- calls list to get th efiles
create external table foo...
drop table foo;
-- reuses the cached file list
create external table foo ...Specifically it seems like most people would expect that calling What do you think about keeping the caches table scoped? We can do it either in this PR or as a follow on? |
|
@alamb I think there's a pretty strong justification for doing something around table initialization as well. I'm not sure what solution you had in mind, but I think it would potentially be reasonable to treat a Somewhat selfishly I think I would prefer this in a follow-on PR. I'm currently in the middle of a high priority bug hunt and I'm trying to take some time off until the new year, so I personally won't have time to pursue a solution for a little over a week. It would be nice to keep things moving and I hope a follow-on issue might allow someone else to implement it while I'm recharging a bit! |
Yes absolutely -- makes sense. I will plan to file a follow on issue shortly and then merge this PR |
Which issue does this PR close?
ListFilesCacheimplementation for theListingTable#18827Rationale for this change
Now that the
DefaultListFilesCachecan be configured by users it's safe to enable it by default and fix the tests that caching broke!What changes are included in this PR?
Are these changes tested?
Yes, additional test cases have been added to help show the behavior of the caching
Are there any user-facing changes?
Yes, this changes the default behavior of DataFusion, however this information is already captured in the upgrade guide.
cc @alamb