#328 - Remove annotations in a specified range by dmnc-grdnr · Pull Request #329 · dkpro/dkpro-cassis

dmnc-grdnr · 2025-08-10T11:23:27Z

This PR adds the following convenience functions to the cas object:

cut_sofa_string_to_range -> replaces the sofa string with a cutout of the original as specified by the index range and removes all annotations in the cas that are outside of that specified range
remove_in_range -> removes all annoations (of one specified type) inside of the specified index range of the sofa string

The methods can be called on any cas object directly.

Tests for each function was added to test_cas.py

- added functions for removing annoations from cas - cut_sofa_string_to_range - remove_annotations_in_range - added tests

reckart · 2025-08-10T11:28:34Z

Thanks for the PR.

"Normally" (i.e. in particular in Java UIMA), the sofa is considered to be immutable once set.

Have you considered generating a new CAS that contains only the cut-out section instead of doing an in-place modification?

dmnc-grdnr · 2025-08-10T18:18:59Z

I didn't expect such a quick response, thanks :)
Maybe I have overlooked something, but how can you only "partially" read a CAS file or only the section of interest?
Because I don't generate the CAS files I use myself, I am looking for a way to cutout a specific part of it without having to manually modify the original CAS file.

Copilot

Pull Request Overview

This PR adds two convenience methods to the Cas class for managing annotations based on their position in the sofa string: cut_sofa_string_to_range and remove_in_range.

Key changes:

cut_sofa_string_to_range: Cuts the sofa string to a specified range and removes/adjusts annotations accordingly
remove_in_range: Removes annotations within a specified range, optionally filtered by type
Comprehensive test coverage for both methods with different scenarios

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
cassis/cas.py	Adds two new methods: `cut_sofa_string_to_range` for cutting sofa string with annotation adjustment, and `remove_in_range` for removing annotations in a range
tests/test_cas.py	Adds four test functions covering different scenarios for both new methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cassis/cas.py

Copilot · 2025-11-10T19:45:43Z

cassis/cas.py

+                    annotation.end = annotation.end - sofa_begin
+                elif overlap and sofa_begin < annotation.end <= sofa_end:
+                    annotation.begin = 0
+                    annotation.end = sofa_end - sofa_begin


Incorrect calculation for overlapping annotation end position. When an annotation starts before sofa_begin but ends within the range, the end should be annotation.end - sofa_begin, not sofa_end - sofa_begin. The current code incorrectly extends the annotation to the full cutout range.

Suggested change

annotation.end = sofa_end - sofa_begin

annotation.end = annotation.end - sofa_begin

I agree with this suggestion.

Copilot · 2025-11-10T19:45:43Z

cassis/cas.py

+        annotations = self.select_all() if type_ is None else self.select(type_)
+        if 0 <= cut_begin < cut_end <= len(self.sofa_string):
+            for annotation in annotations:
+                if cut_begin <= annotation.begin < annotation.end <= cut_end:
+                    self.remove(annotation)


Iterating and removing from collection during iteration can cause issues. When using self.select_all(), the result is a list, so this works. However, converting to a list explicitly (e.g., list(annotations)) before iteration would be safer and more explicit to prevent potential issues if the return type changes in the future.

Personally, I believe the code should fail if the return types of cas.select_all and cas.select change, because it signifies a major change in functionality. But I am on the fence on this one.

I agree with the Copilot. Just because select_all() returns a List, it does not mean that the List may not be a live list. Removing stuff from a live list would be problematic. When removing stuff from a container such as the CAS, it is a good practice to make a snapshot of the element list over which to iterate before modifying the container.

Copilot · 2025-11-10T19:45:44Z

cassis/cas.py

+            for annotation in self.select_all():
+                if sofa_begin <= annotation.begin and annotation.end <= sofa_end:
+                    annotation.begin = annotation.begin - sofa_begin
+                    annotation.end = annotation.end - sofa_begin
+                elif overlap and sofa_begin < annotation.end <= sofa_end:
+                    annotation.begin = 0
+                    annotation.end = sofa_end - sofa_begin
+                elif overlap and sofa_begin <= annotation.begin < sofa_end:
+                    annotation.begin = annotation.begin - sofa_begin
+                    annotation.end = len(self.sofa_string)
+                elif overlap and annotation.begin <= sofa_begin and sofa_end <= annotation.end:
+                    annotation.begin = 0
+                    annotation.end = len(self.sofa_string)
+                else:
+                    self.remove(annotation)


Iterating over select_all() and removing items during iteration can cause issues. The list returned by select_all() is modified during iteration when self.remove(annotation) is called. Consider iterating over a copy: for annotation in list(self.select_all()):

select_all returns an independent list of annotations. It should be safe to directly remove an annotation from the CAS while iterating over the list.

The interface only promises to return a List, it does not promise that this list is an independent object. See comment above.

Copilot · 2025-11-10T19:45:44Z

tests/test_cas.py

+
+    assert len(cas.select_all()) == len(expected_leftover_annotations)
+
+    print(cas.sofa_string)


Debug print statement left in test code. This should be removed or replaced with an assertion that verifies the expected sofa_string value.

Suggested change

print(cas.sofa_string)

Copilot · 2025-11-10T19:45:44Z

tests/test_cas.py

+
+    expected_leftover_annotations = [annotation for annotation in cas.select_all()
+                                     if (begin <= annotation.begin < end)
+                                        or (annotation.begin < begin < end <= annotation.end)]


The expected leftover annotations logic doesn't match the implementation. The condition doesn't account for all overlap cases. For example, annotations where annotation.end overlaps but annotation.begin < begin and annotation.end > end should be included but aren't. The condition should be: if (begin <= annotation.begin < end) or (begin < annotation.end <= end) or (annotation.begin < begin and annotation.end > end)

Suggested change

or (annotation.begin < begin < end <= annotation.end)]

or (begin < annotation.end <= end)

or (annotation.begin < begin and annotation.end > end)]

I agree with this suggestion.

dmnc-grdnr · 2025-11-15T09:51:36Z

I agree with most suggestions, with the exception of the issue iteration/remove (384-398). But I am, again, unsure how to proceed from here on :)

reckart · 2025-11-28T21:05:41Z

Thanks again :) Same as with the other PR:

make changes
make regular commit (no amend, no force push)

This change is a tricky one because you can have annotations that are transitively referenced through a feature of an indexed feature structure, not that themselves are not on the index. That means if you use only select_all(), you would not find them. You would have to use _find_all_fs like in the other PR to actually find those. And then the question is: if you have an annotation that is transitively referenced that would end up being outside of the document, should it be removed entirely and unlinked from its referent - or should it remain but its position be collapsed to the start or end of the sofa? Or possibly something else entirely?

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T06:00:32Z

tests/test_cas.py

+def test_cut_sofa_string_to_range_no_overlap(small_typesystem_xml, small_xmi):
+    typesystem = load_typesystem(small_typesystem_xml)
+    cas = load_cas_from_xmi(small_xmi, typesystem)
+
+    begin = 10
+    end = 20
+
+    expected_leftover_annotations = [annotation for annotation in cas.select_all()
+                                     if begin <= annotation.begin < annotation.end <= end]
+
+    cas.cut_sofa_string_to_range(begin, end, overlap=False)
+
+    assert len(cas.select_all()) == len(expected_leftover_annotations)
+


The test doesn't verify that the sofa string is actually cut correctly or that annotation offsets are properly adjusted. Consider adding assertions to check that cas.sofa_string == original_sofa_string[begin:end] and that specific annotations have the expected begin/end offsets after the cut operation.

Copilot · 2026-01-27T06:00:32Z

cassis/cas.py

+    def remove_in_range(self, cut_begin:int, cut_end:int, type_: Union[Type, str]=None):
+        """Removes annotations between two indices of the sofa string.
+
+        Args:
+            cut_begin: The beginning of the cutting interval.
+            cut_end: The end of the cutting interval.
+            type_: The type or name of the type name whose annotation instances are to be found
+        Raises:
+            ValueError: If range indices are invalid or annotation type_ not found.
+        """
+
+        annotations = self.select_all() if type_ is None else self.select(type_)
+        if 0 <= cut_begin < cut_end <= len(self.sofa_string):
+            for annotation in annotations:
+                if cut_begin <= annotation.begin < annotation.end <= cut_end:
+                    self.remove(annotation)
+        else:
+            raise ValueError(f"Invalid indices for begin {cut_begin} and end {cut_end}")


For consistency with other methods in this file (like _get_feature_structures_in_range at line 541), consider using plain begin and end parameter names instead of cut_begin and cut_end. The cut_ prefix is somewhat redundant given the method name already indicates the cutting operation.

Suggested change

def remove_in_range(self, cut_begin:int, cut_end:int, type_: Union[Type, str]=None):

"""Removes annotations between two indices of the sofa string.

Args:

cut_begin: The beginning of the cutting interval.

cut_end: The end of the cutting interval.

type_: The type or name of the type name whose annotation instances are to be found

Raises:

ValueError: If range indices are invalid or annotation type_ not found.

"""

annotations = self.select_all() if type_ is None else self.select(type_)

if 0 <= cut_begin < cut_end <= len(self.sofa_string):

for annotation in annotations:

if cut_begin <= annotation.begin < annotation.end <= cut_end:

self.remove(annotation)

else:

raise ValueError(f"Invalid indices for begin {cut_begin} and end {cut_end}")

def remove_in_range(self, begin: int, end: int, type_: Union[Type, str] = None):

"""Removes annotations between two indices of the sofa string.

Args:

begin: The beginning of the cutting interval.

end: The end of the cutting interval.

type_: The type or name of the type name whose annotation instances are to be found

Raises:

ValueError: If range indices are invalid or annotation type_ not found.

"""

annotations = self.select_all() if type_ is None else self.select(type_)

if 0 <= begin < end <= len(self.sofa_string):

for annotation in annotations:

if begin <= annotation.begin < annotation.end <= end:

self.remove(annotation)

else:

raise ValueError(f"Invalid indices for begin {begin} and end {end}")

Copilot · 2026-01-27T06:00:32Z

cassis/cas.py

        """
        self.add_all(annotations)

+    def cut_sofa_string_to_range(self, sofa_begin:int, sofa_end:int, overlap=True):


The parameter type hints are missing spaces after the colons. According to PEP 8, there should be a space after the colon in type hints. The signature should be def cut_sofa_string_to_range(self, sofa_begin: int, sofa_end: int, overlap=True): (note the spaces after the colons).

Suggested change

def cut_sofa_string_to_range(self, sofa_begin:int, sofa_end:int, overlap=True):

def cut_sofa_string_to_range(self, sofa_begin: int, sofa_end: int, overlap=True):

Copilot · 2026-01-27T06:00:33Z

cassis/cas.py

+            but keeps annotations that overlap with cutout points by default.
+
+                Args:
+                    sofa_begin: The beginning of the cutout sofa.
+                    sofa_end: The end of the cutout sofa.
+                    overlap: If true, keeps overlapping annotations and modifies begin and end of annotation accordingly.
+
+                Raises:
+                    ValueError: If cutout indices are invalid.
+                """


The docstring has inconsistent indentation. The Args and Raises sections should be indented to align with the opening quotes, not have extra indentation. This is inconsistent with the docstring format used elsewhere in the file (see the remove method at line 402 or add method at line 315).

Suggested change

but keeps annotations that overlap with cutout points by default.

Args:

sofa_begin: The beginning of the cutout sofa.

sofa_end: The end of the cutout sofa.

overlap: If true, keeps overlapping annotations and modifies begin and end of annotation accordingly.

Raises:

ValueError: If cutout indices are invalid.

"""

but keeps annotations that overlap with cutout points by default.

Args:

sofa_begin: The beginning of the cutout sofa.

sofa_end: The end of the cutout sofa.

overlap: If true, keeps overlapping annotations and modifies begin and end of annotation accordingly.

Raises:

ValueError: If cutout indices are invalid.

"""

Copilot · 2026-01-27T06:00:33Z

cassis/cas.py

+                    annotation.end = annotation.end - sofa_begin
+                elif overlap and sofa_begin < annotation.end <= sofa_end:
+                    annotation.begin = 0
+                    annotation.end = sofa_end - sofa_begin


The calculation of annotation.end is incorrect for annotations that start before the range and end inside it. When an annotation starts before sofa_begin and ends at some position within the range, the new end position should be annotation.end - sofa_begin, not sofa_end - sofa_begin. For example, if an annotation spans [0, 15] and we cut the sofa string to [10, 20], the annotation should become [0, 5] in the new coordinate system, not [0, 10].

Suggested change

annotation.end = sofa_end - sofa_begin

annotation.end = annotation.end - sofa_begin

Copilot · 2026-01-27T06:00:33Z

tests/test_cas.py

+    end = 20
+
+    expected_leftover_annotations = [annotation for annotation in cas.select_all()
+                                     if (begin <= annotation.begin < end)


The test logic for determining expected leftover annotations is incomplete. It's missing annotations that start before the range and end inside it. The condition should also include (annotation.begin < begin and begin < annotation.end <= end) to account for annotations that overlap on the left side. With the current test data, this might not be caught, but it would fail with different test data.

Suggested change

if (begin <= annotation.begin < end)

if (begin <= annotation.begin < end)

or (annotation.begin < begin and begin < annotation.end <= end)

Copilot · 2026-01-27T06:00:33Z

cassis/cas.py

+            cut_end: The end of the cutting interval.
+            type_: The type or name of the type name whose annotation instances are to be found
+        Raises:
+            ValueError: If range indices are invalid or annotation type_ not found.


The Raises documentation mentions "annotation type_ not found" but this error is not actually handled by this function - it would propagate as a TypeNotFoundError from the underlying select() method. Following the convention used in similar methods like select(), select_covered(), etc., the Raises section should only document the ValueError for invalid indices and remove the mention of type not found.

Suggested change

ValueError: If range indices are invalid or annotation type_ not found.

ValueError: If range indices are invalid.

Copilot · 2026-01-27T06:00:34Z

tests/test_cas.py

+def test_cut_sofa_string_to_range(small_typesystem_xml, small_xmi):
+    typesystem = load_typesystem(small_typesystem_xml)
+    cas = load_cas_from_xmi(small_xmi, typesystem)
+
+    begin = 10
+    end = 20
+
+    expected_leftover_annotations = [annotation for annotation in cas.select_all()
+                                     if (begin <= annotation.begin < end)
+                                        or (annotation.begin < begin < end <= annotation.end)]
+
+    cas.cut_sofa_string_to_range(begin, end)
+
+    assert len(cas.select_all()) == len(expected_leftover_annotations)
+


The test doesn't verify that the sofa string is actually cut correctly or that annotation offsets are properly adjusted. Consider adding assertions to check that cas.sofa_string == original_sofa_string[begin:end] and that specific annotations have the expected begin/end offsets after the cut operation.

Copilot · 2026-01-27T06:00:34Z

tests/test_cas.py

+def test_cut_sofa_string_to_range(small_typesystem_xml, small_xmi):
+    typesystem = load_typesystem(small_typesystem_xml)
+    cas = load_cas_from_xmi(small_xmi, typesystem)
+
+    begin = 10
+    end = 20
+
+    expected_leftover_annotations = [annotation for annotation in cas.select_all()
+                                     if (begin <= annotation.begin < end)
+                                        or (annotation.begin < begin < end <= annotation.end)]
+
+    cas.cut_sofa_string_to_range(begin, end)
+
+    assert len(cas.select_all()) == len(expected_leftover_annotations)
+


The test data doesn't include any annotations that start before the cut range and end inside it (left-overlap case), so the bug in line 390 is not caught by this test. Consider adding test data that exercises this scenario, for example, by using a range like [5, 15] which would catch token [4, 10] in a left-overlap situation.

Copilot · 2026-01-27T06:00:34Z

tests/test_cas.py

+
+    print(cas.sofa_string)


There's a leftover print statement in the test that should be removed before merging.

Suggested change

print(cas.sofa_string)

reckart · 2026-01-27T06:06:32Z

@zesch Do you have an opinion on this?

if you have an annotation that is transitively referenced that would end up being outside of the document, should it be removed entirely and unlinked from its referent - or should it remain but its position be collapsed to the start or end of the sofa? Or possibly something else entirely?

zesch · 2026-01-27T09:47:01Z

@zesch Do you have an opinion on this?

if you have an annotation that is transitively referenced that would end up being outside of the document, should it be removed entirely and unlinked from its referent - or should it remain but its position be collapsed to the start or end of the sofa? Or possibly something else entirely?

I would vote for removing it entirely.

dkpro#328 - Remove annotations in a specified range

61abb88

- added functions for removing annoations from cas - cut_sofa_string_to_range - remove_annotations_in_range - added tests

reckart assigned dmnc-grdnr Nov 10, 2025

reckart added this to the 0.11.0 milestone Nov 10, 2025

reckart added the 🆕Enhancement label Nov 10, 2025

Merge branch 'main' into feature/328-remove-annotations

9b5005f

reckart requested a review from Copilot November 10, 2025 19:43

Copilot AI reviewed Nov 10, 2025

View reviewed changes

Merge branch 'dkpro:main' into feature/328-remove-annotations

b721505

reckart requested a review from Copilot January 27, 2026 05:53

Copilot started reviewing on behalf of reckart January 27, 2026 05:53 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

	annotation.end = sofa_end - sofa_begin
	annotation.end = annotation.end - sofa_begin


		assert len(cas.select_all()) == len(expected_leftover_annotations)

		print(cas.sofa_string)

	or (annotation.begin < begin < end <= annotation.end)]
	or (begin < annotation.end <= end)
	or (annotation.begin < begin and annotation.end > end)]

	def cut_sofa_string_to_range(self, sofa_begin:int, sofa_end:int, overlap=True):
	def cut_sofa_string_to_range(self, sofa_begin: int, sofa_end: int, overlap=True):

	if (begin <= annotation.begin < end)
	if (begin <= annotation.begin < end)
	or (annotation.begin < begin and begin < annotation.end <= end)

	ValueError: If range indices are invalid or annotation type_ not found.
	ValueError: If range indices are invalid.

Conversation

dmnc-grdnr commented Aug 10, 2025

Uh oh!

reckart commented Aug 10, 2025

Uh oh!

dmnc-grdnr commented Aug 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

dmnc-grdnr Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

dmnc-grdnr Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

reckart Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

dmnc-grdnr Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

reckart Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

dmnc-grdnr Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

dmnc-grdnr commented Nov 15, 2025

Uh oh!

reckart commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

reckart commented Nov 28, 2025 •

edited

Loading