CodeCache: Accept multiple threads in SaveData#5500
Conversation
This allows mapping the code directly into memory for execution.
…ation This ensures that any code buffer memory owned by the MappedResource is invalidated before being deallocated.
A significant proportion of time with the prior approach was spent in the unbuffered write calls used once per code page/entrypoint; taking upwards of 15 seconds to serialize a 700MB P3R cache. Switching to an mmap approach reduces the CPU time there to less than a second, leaving I/O to dominate. A two-pass approach is used: 1. Calculate the required size for the cache file, issue a callback to frontend to map a cache file of that size. 2. Write cache data into the buffer provided by that callback.
Merges and sorts relocations from all provided threads before serializing, enabling multi-threaded offline code cache generation.
|
Rebase necessary. |
neobrain
left a comment
There was a problem hiding this comment.
Could you PR an actual implementation of multithreaded writing instead so that there's a meaningful base for discussion? The interface change to SaveData on its own isn't very useful.
Regarding mmap writing, I would prefer this to stay in a PR for now. What we have today is easier to iterate on, and I've got upcoming implementation changes in SaveData that would interfere unnecessarily. If serialization time is critical, simple buffered writing would work as a stop-gap.
I think it should be fairly easy to iterate on, like I explicitly did it in a generic way that is trivial to extend without new address logic? I think unless the code quality is meaningfully bad it would be best to merge instead of asking to reinvent the wheel again, delaying it again just means more conflict and more effort for whoever ends up needing to clean it up. And like it makes things 10x faster or smth so may as well take the win. Could you PR an actual implementation of multithreaded writing instead so that there's a meaningful base for discussion? The interface change to SaveData on its own isn't very useful. |
I'm only measuring a 2x improvement (both for small and large caches), which can equally be achieved with simple buffering that doesn't require us to awkwardly split IO logic or to do a double-pass. (My guess is your 10x measurement was made before I addressed the trivial low-hanging fruit in SaveData?)
How is it dependent? Surely this can already be illustrated Linux-side with everything being upstream there already? |
|
How is it dependent? Surely this can already be illustrated Linux-side with everything being upstream there already? I have not implemented this for the Linux side and do not have time to. If you want me to pr windows side I can. I'm only measuring a 2x improvement (both for small and large caches), which can equally be achieved with simple buffering that doesn't require us to awkwardly split IO logic or to do a double-pass. (My guess is your 10x measurement was made before I addressed the trivial low-hanging fruit in SaveData?) I really don't think it's that awkward to do two passes, 2x improvement is pretty good and meaningful. Especially in the future mmap is much more flexible that just chained writes, and FEX is pretty well acquainted with mmap anyway :) |
Needs some consideration of if this is the cleanest way, but this is certainly the simplest.