Commit Graph

9 Commits

Author SHA1 Message Date
Devin AI
583c72e340 feat: Add TXT and MD file format support to DocumentConverter
- Add .txt and .md extensions to SUPPORTED_FORMATS mapping
- Add _convert_txt_to_markdown method for plain text files
- Support docling's native MD InputFormat for markdown files
- Add proper format detection and routing logic
- Preserve existing PDF OCR detection and multi-format support

Co-Authored-By: PromptEngineer <jnfarooq@outlook.com>
2025-07-21 20:47:24 +00:00
Devin AI
d5929ce29b feat: Add support for DOCX and HTML file formats using docling
- Rename PDFConverter to DocumentConverter with multi-format support
- Add SUPPORTED_FORMATS mapping for PDF, DOCX, HTML, HTM extensions
- Update indexing pipeline to use DocumentConverter
- Update file validation across all frontend components and scripts
- Preserve existing PDF OCR detection logic
- Add format-specific conversion methods for different document types

Co-Authored-By: PromptEngineer <jnfarooq@outlook.com>
2025-07-21 20:40:39 +00:00
PromptEngineer
acf6efb5a4 fix: Add comprehensive NaN handling for LanceDB indexing
- Add NaN and infinite value detection in QwenEmbedder and OllamaEmbedder
- Implement LanceDB table creation with on_bad_vectors='drop' parameter
- Add fallback strategy with on_bad_vectors='fill' and fill_value=0.0
- Add pre-filtering of chunks with invalid embeddings before indexing
- Add NaN validation to LateChunkEncoder
- Add detailed logging for skipped chunks and error handling
- Resolves LanceDB error: 'Vector column has NaNs' during indexing

This fix ensures robust handling of edge cases in embedding generation
and prevents indexing failures due to invalid vector values.
2025-07-18 00:26:39 -07:00
PromptEngineer
35697b23a4 fix: implement automatic database path detection for multi-environment compatibility
- Add environment auto-detection in ChatDatabase class
- Support both local development and Docker container paths
- Local development: uses 'backend/chat_data.db' (relative path)
- Docker containers: uses '/app/backend/chat_data.db' (absolute path)
- Maintain backward compatibility with explicit path overrides
- Update RAG API server to use auto-detection

This resolves the SQLite database connection error that occurred
when running LocalGPT in local development environments while
maintaining compatibility with Docker deployments.

Fixes: Database path hardcoded to Docker container path
Tested: Local development and Docker environment detection
Breaking: No breaking changes - maintains backward compatibility
2025-07-17 22:13:25 -07:00
Devin AI
a13a71d247 fix: make both chunking methods token-based
- Update MarkdownRecursiveChunker to use tokenizer for token-based sizing
- Update DoclingChunker to use tokenizer with proper error handling
- Ensure IndexingPipeline passes tokenizer_model to both chunkers
- Update UI tooltips to reflect that both modes now use tokens
- Keep Docling as default for enhanced granularity features
- Add fallback to character-based approximation when tokenizer fails

Co-Authored-By: PromptEngineer <jnfarooq@outlook.com>
2025-07-15 06:38:55 +00:00
Devin AI
3b648520c9 fix: default to token-based chunking for accurate chunk sizing
- Change default chunker_mode from 'legacy' to 'docling' for token-based chunking
- Update UI to reflect new default with DoclingChunk enabled by default
- Improve tooltips to clarify token vs character chunking behavior
- Fixes issue where 512 token setting was using character-based chunking

Co-Authored-By: PromptEngineer <jnfarooq@outlook.com>
2025-07-15 06:05:22 +00:00
PromptEngineer
6d73a61e5c refactor: Remove unused imports across codebase
Removed unused import statements from various Python files to improve code clarity and reduce unnecessary dependencies.
2025-07-12 02:34:17 -07:00
PromptEngineer
c93b8639ab fix(db): Correct database path and chat history logic 2025-07-12 01:51:57 -07:00
PromptEngineer
2421514f3e Integrate multimodal RAG codebase
- Replaced existing localGPT codebase with multimodal RAG implementation
- Includes full-stack application with backend, frontend, and RAG system
- Added Docker support and comprehensive documentation
- Enhanced with multimodal capabilities for document processing
- Preserved git history for localGPT while integrating new functionality
2025-07-11 00:17:15 -07:00