-
Notifications
You must be signed in to change notification settings - Fork 767
NXP Backend: Adding the loader for eIQ Neutron SDK converter module. #16326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
jirioc
wants to merge
1,243
commits into
pytorch:main
from
nxp-upstream:nxf96559/feature/EIEX-657-implement-loading-mechanism-for-eiq-neutron-sdk-converter-module
Closed
NXP Backend: Adding the loader for eIQ Neutron SDK converter module. #16326
jirioc
wants to merge
1,243
commits into
pytorch:main
from
nxp-upstream:nxf96559/feature/EIEX-657-implement-loading-mechanism-for-eiq-neutron-sdk-converter-module
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16326
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e5deb89 with merge base 0935300 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
6adb5b2 to
1d7b384
Compare
cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai Signed-off-by: Sebastian Larsson <[email protected]>
Pass valid test_data to run_method_and_compare_outputs in PassPipeline. Signed-off-by: Oscar Andersson <[email protected]>
Differential Revision: D86785126 Pull Request resolved: pytorch#15757
Seeing error when important flat tensor deserializer ``` (executorch) [[email protected] /data/users/lfq/executorch (aff5086)]$ python debug-lora.py Traceback (most recent call last): File "/data/users/lfq/executorch/debug-lora.py", line 1, in <module> from executorch.extension.flat_tensor.serialize.serialize import _deserialize_to_flat_tensor File "/data/users/lfq/executorch/src/executorch/extension/flat_tensor/serialize/serialize.py", line 20, in <module> from executorch.exir._serialize._cord import Cord File "/data/users/lfq/executorch/src/executorch/exir/__init__.py", line 9, in <module> from executorch.exir.capture import ( File "/data/users/lfq/executorch/src/executorch/exir/capture/__init__.py", line 9, in <module> from executorch.exir.capture._capture import ( File "/data/users/lfq/executorch/src/executorch/exir/capture/_capture.py", line 17, in <module> from executorch.exir.program import ExirExportedProgram File "/data/users/lfq/executorch/src/executorch/exir/program/__init__.py", line 10, in <module> from executorch.exir.program._program import ( File "/data/users/lfq/executorch/src/executorch/exir/program/_program.py", line 82, in <module> from executorch.extension.flat_tensor.serialize.serialize import FlatTensorSerializer ImportError: cannot import name 'FlatTensorSerializer' from partially initialized module 'executorch.extension.flat_tensor.serialize.serialize' (most likely due to a circular import) (/data/users/lfq/executorch/src/executorch/extension/flat_tensor/serialize/serialize.py) ``` Previously, the import happened at module load time, causing the circular dependency. Now, the import happens at runtime, and we do not hit the circular dep.
Arm backend: Support quantized while_loop
- Add annotation logic.
- Extend cond handling in q-dq folding to while
- Extend InsertCondRescale pass to handle while.
----------------------------------------------------
Arm backend: Add initial while_loop support.
- Refactor CondSupported to also test while, move to own file
and split into one check for submodule nodes, and one for ops.
- Add node visitor
- Add tests
-----------------------------------------------------
Arm backend: Initial quantization support for conditional
The standard prepare/convert_pt2 does not seem to support
quantization out of the box. Instead, a quantization call
is introduced in the TOSAQuantizer, that does the neccessary
steps to get correct quantization on submodules. A custom
Quantize step is needed in the ArmTester to make this work
in testing.
Additionally, getting correct quantization parameters needs
some delicate handling. The model is calibrated twice,
once for each code path. Because of this, the observers outside
the if/else submodules see different data than the observers
inside the submodules. Rescales need to be inserted to handle
this. To get a correctly traced graph at all times, we
1. Fold the outmost quant ops in the submodules at the same
time as the cond is folded. Add qparam meta to folded
nodes inside submodule.
2. Use this meta in the InsertCondRescale pass to
insert a tosa.RESCALE to handle the different qparams.
3. After this, the submodule's q-dq nodes can be folded normally.
Signed-off-by: Erik Lundell <[email protected]>
* Add *Pass suffix for passes missing it * Add missing _pass suffixes to python files containing passes * Correct pass name: DecomposeLinearVectorNormPass to DecomposeLinalgVectorNormPass * Rename ConvertIntPowToMuls to DecomposeIntPowPass * Rename QuantizeOperatorArguments to QuantizeClampArgumentsPass Signed-off-by: Martin Lindström <[email protected]>
A lightweight Undefinedsanitizer runtime tailored for the ExecuTorch bare metal examples. The goal is to provide basic memory safety diagnostics while keeping the runtime self-contained. Signed-off-by: [email protected]
Currently only support for NHWC memory format. Adds weights-per-tensor quantization as default quantization. Generalizes the CortexMPassManager to work with any pass with exported_program as init arg to support using both XNNPackPasses and ArmPasses. Refactors the QuantizedLinearFusionPass to become a more general CovertToCortexMPass for replacing Aten ops with corresponding Cortex-M ops. Signed-off-by: Adrian Lundell <[email protected]>
quck-fix after pytorch#15896 cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai Signed-off-by: Adrian Lundell <[email protected]>
ATen clone ops can end up in the graph from a few sources. Since the graph is functional, we don't actually need these and they are slow. This PR runs the no-op clone removal pass for XNNPACK. In addition to this, I ran into an issue where XNNPACK delegate doesn't currently handle inputs being directly forwarded to partition outputs. There has to be at least one operator. To solve this, I updated the removal pass to leave these clone ops in and added copy support in the XNN delegate to direct copy to the output. In the long-run, I want to remove these no-ops higher up as part of to_edge, but this requires alignment and changes with a few more backends. See pytorch#15838. But resolving for XNNPACK will mitigate the issue for CPU models, at least. Differential Revision: D87405074
pytorch#15881) As title, addressed issues exposed from pytorch/pytorch#168098 ``` FAILED exir/backend/test/test_debug_handle_map.py::TestBackendDebugHandle::test_lowered_the_whole_model - UnboundLocalError: local variable 'qnn_compile_spec_buffer' referenced before assignment Falsifying example: test_lowered_the_whole_model( unlift=False, self=<test_debug_handle_map.TestBackendDebugHandle testMethod=test_lowered_the_whole_model>, ) ```
deserialize_pte_binary now returns an object with .program as a field, instead of returning program directly.
Update TosaQuantizer to use get_module_name_filter from torchao. Adds new tests to validate that set_module_name works as intended. Fixes pytorch#15870 cc @freddan80 @per @zingo @digantdesai Signed-off-by: Oscar Andersson <[email protected]>
…ytorch#15887) cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai Signed-off-by: Sebastian Larsson <[email protected]> Co-authored-by: Zingo Andersen <[email protected]>
Move support_extension to TosaSpecification base class to avoid having to check whether the TosaSpecification is an instance of TosaSpecification_1_00. cc @freddan80 @per @zingo @digantdesai Signed-off-by: Oscar Andersson <[email protected]>
By default outputs are re-ordered to correct order during TOSA lowering. However this is seen as a workaround as it should not be needed. Furthermore the output issue is not easily reproduced, rather it seems to happen randomly. Therefore we add a test case without the workaround, which is currently passing. In case it won't pass without the workaround at some point, the new changes might give some hints on why the workaround is needed and how to fix it. In case it continues to pass, we may switch the default and potentially even remove the workaround. cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai --------- Co-authored-by: Zingo Andersen <[email protected]>
Differential Revision: D87252895 Pull Request resolved: pytorch#15885
Summary: # Summary This diff consolidates the backend functionality into a single target `//executorch/backends/aoti:aoti_backend` and simplifies the cuda backend target by making it dependent on the consolidated backend target. The following changes are made in this diff: * Creation of a new target `//executorch/backends/aoti:aoti_backend` in `fbcode/executorch/backends/aoti/targets.bzl` which includes the necessary dependencies for the AOTI backend. * Update of the `//executorch/backends/cuda:cuda_backend` target in `fbcode/executorch/backends/cuda/TARGETS` to depend on the new `//executorch/backends/aoti:aoti_backend` target instead of individual AOTI backend dependencies. * Creation of a new file `fbcode/executorch/backends/aoti/aoti_backend.py` which imports the necessary dependencies and passes for the AOTI backend. * Simplification of the `xplat/executorch/backends/cuda/cuda_backend.py` file by removing unnecessary imports and using the new `AotiBackend` class from the `aoti_backend.py` file. ghstack-source-id: 319556735 Reviewed By: larryliu0820 Differential Revision: D85704977 --------- Co-authored-by: Copilot <[email protected]>
Summary: Introduce cuda benchmark ci for monitoring cuda backend performance. the ci will run on three situations: 1. it will run all possible models (voxtral, gemma and whisper) combine with all possible quantization schema on every day's 1am pst; 2. it will run an random model everytime a PR got merged; 3. manually tirggered by user. Differential Revision: D87400561
Executor runner supports both models with/wo bundled io in same path. To enable bundled IO EXECUTORCH_BUILD_DEVTOOLS and EXECUTORCH_ENABLE_BUNDLE_IO are required. Adds tests in Arm backend for testing this/depending on this. Except for enabling bundle-io for VGF backend where applicable, some additional resnets model tests are enabled as well. Avoids narrowing conversion errors in pte_to_header script by switching char to unsigned char. Signed-off-by: Måns Nilsson <[email protected]> Co-authored-by: Jacob Szwejbka <[email protected]>
Summary: Suspect the failure https://github.com/pytorch/pytorch/actions/runs/19547462483/job/55989739476 is due to using different QnnBackend implementation. Rename this demo backend to a demo backend name Differential Revision: D87586567
### Summary LoraLinears contain: 1. base weight (nn.Linear) 2. lora_a (nn.Linear) 3. lora_b (nn.Linear) (2) and (3) are caught by the filter, but (1) is not, as the weight and bias are pulled out of the nn.Linear and placed into nn.Parameters, and the linear is performed manually. This is for checkpoint compatibility - otherwise we'd have to map the weights for any lora model. See: https://github.com/pytorch/executorch/blob/b4d72f1e271915e9c0e1d313753a1eec840fbdee/examples/models/llama/lora.py#L31-L37 This PR adds lora linears into the quantization filter. ### Test plan ``` python -m extension.llm.export.export_llm \ base.checkpoint="${DOWNLOADED_PATH}/consolidated.00.pth" \ base.params="${DOWNLOADED_PATH}/params.json" \ base.adapter_checkpoint="../et_docs_7_epoch/adapter_model.safetensors" \ base.adapter_config="../et_docs_7_epoch/adapter_config.json" \ base.tokenizer_path="../et_docs_7_epoch/" \ model.use_kv_cache=true \ model.use_sdpa_with_kv_cache=true \ ``` Confirm output model size is ~1.7GB instead of 5.1GB. ``` (executorch) [[email protected] /data/users/lfq/executorch (lfq.quantize-lora-linears)]$ ls -la *.pte -rw-r--r-- 1 lfq users 5106135168 Nov 20 15:59 et_lora.pte -rw-r--r-- 1 lfq users 1733835776 Nov 20 17:07 et_lora_fix.pte ```
Decompose aten.clamp.Tensor to a chain of minimum/maximum operations, since TOSA clamp only supports scalar min/max bounds. This change: - Rename DecomposeInt32ClampPass to DecomposeTOSAUnsupportedClampPass - Extend the pass to handle aten.clamp.Tensor variants already annotated by ArmQuantizer - Mark aten.clamp.Tensor as supported in the TOSA operator support lists - Fix clamp decomposition to apply max with the lower bound, then min with the upper bound (y_i = min(max(x_i, min_i), max_i)), matching torch.clamp semantics even when min > max - Align the memory layout of lifted tensor constants in process_inputs_to_lifted_tensor_constants with the TOSA memory format Signed-off-by: Yufeng Shi <[email protected]> Co-authored-by: Ryan O'Shea <[email protected]> Co-authored-by: Oscar Andersson <[email protected]>
- Support additional_args for while by normalizing them to carried_args in a pass. - Support not using all while outputs. Signed-off-by: Erik Lundell <[email protected]>
Add method to verify dtypes of TOSA-operators in ArmTester. Signed-off-by: Oscar Andersson <[email protected]>
Adds new checks to quantization annotation to not fuse conv+relu patterns when output_qspec is symmetric. Symmetric quantization would force the zero-point to be 0, and in order for us to fuse relu the zero-point must equal qmin. Also adds new tests to verify that it works. Signed-off-by: Oscar Andersson <[email protected]>
Differential Revision: D89308070 Pull Request resolved: pytorch#16283
…er hit validation Differential Revision: D89322661 Pull Request resolved: pytorch#16299
Need docker builds to pass before we advance viable/strict
Differential Revision: D89093678 Pull Request resolved: pytorch#16301
1d7b384 to
e5deb89
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
module: nxp
Issues related to NXP Neutron NPU delegation and code under backends/nxp/
release notes: nxp
Changes to the NXP Neutron backend delegate
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adding the loader for eIQ Neutron SDK converter module.
Test plan
This was tested by running test_neutron_converter_manager.py
cc @robert-kalmar @JakeStevens @digantdesai