+

Discussion: Feature Request

+

This is a long paragraph about the feature request with enough words to + pass the pruning threshold easily. The feature would add support for document + extraction in the crawl pipeline, enabling binary documents like PDFs and + DOCX files to be processed alongside HTML pages.

+ +
+
+ alice + +
+
+

I think this is a great idea. We should implement it using a + pluggable strategy pattern so users can bring their own extraction + backend. This would keep the core library lean while supporting + many document types.

+
+
+ +
+
+ bob + +
+
+

Agreed with alice. The abstract base class approach makes sense. + We could also add a built-in implementation for PDFs since crawl4ai + already has PDFContentScrapingStrategy that could be wrapped.

+
+
+