- Add .txt and .md extensions to SUPPORTED_FORMATS mapping
- Add _convert_txt_to_markdown method for plain text files
- Support docling's native MD InputFormat for markdown files
- Add proper format detection and routing logic
- Preserve existing PDF OCR detection and multi-format support
Co-Authored-By: PromptEngineer <jnfarooq@outlook.com>
- Rename PDFConverter to DocumentConverter with multi-format support
- Add SUPPORTED_FORMATS mapping for PDF, DOCX, HTML, HTM extensions
- Update indexing pipeline to use DocumentConverter
- Update file validation across all frontend components and scripts
- Preserve existing PDF OCR detection logic
- Add format-specific conversion methods for different document types
Co-Authored-By: PromptEngineer <jnfarooq@outlook.com>
- Add NaN and infinite value detection in QwenEmbedder and OllamaEmbedder
- Implement LanceDB table creation with on_bad_vectors='drop' parameter
- Add fallback strategy with on_bad_vectors='fill' and fill_value=0.0
- Add pre-filtering of chunks with invalid embeddings before indexing
- Add NaN validation to LateChunkEncoder
- Add detailed logging for skipped chunks and error handling
- Resolves LanceDB error: 'Vector column has NaNs' during indexing
This fix ensures robust handling of edge cases in embedding generation
and prevents indexing failures due to invalid vector values.
- Add environment auto-detection in ChatDatabase class
- Support both local development and Docker container paths
- Local development: uses 'backend/chat_data.db' (relative path)
- Docker containers: uses '/app/backend/chat_data.db' (absolute path)
- Maintain backward compatibility with explicit path overrides
- Update RAG API server to use auto-detection
This resolves the SQLite database connection error that occurred
when running LocalGPT in local development environments while
maintaining compatibility with Docker deployments.
Fixes: Database path hardcoded to Docker container path
Tested: Local development and Docker environment detection
Breaking: No breaking changes - maintains backward compatibility
- Update MarkdownRecursiveChunker to use tokenizer for token-based sizing
- Update DoclingChunker to use tokenizer with proper error handling
- Ensure IndexingPipeline passes tokenizer_model to both chunkers
- Update UI tooltips to reflect that both modes now use tokens
- Keep Docling as default for enhanced granularity features
- Add fallback to character-based approximation when tokenizer fails
Co-Authored-By: PromptEngineer <jnfarooq@outlook.com>
- Change default chunker_mode from 'legacy' to 'docling' for token-based chunking
- Update UI to reflect new default with DoclingChunk enabled by default
- Improve tooltips to clarify token vs character chunking behavior
- Fixes issue where 512 token setting was using character-based chunking
Co-Authored-By: PromptEngineer <jnfarooq@outlook.com>
- Replaced existing localGPT codebase with multimodal RAG implementation
- Includes full-stack application with backend, frontend, and RAG system
- Added Docker support and comprehensive documentation
- Enhanced with multimodal capabilities for document processing
- Preserved git history for localGPT while integrating new functionality