feat: offline OCR (Tesseract) + embedding category classifier (@xenova/transformers)
Tesseract OCR (PHP, server-side):
- Dockerfile: adds tesseract-ocr + tesseract-ocr-ita + libgd-dev (gd extension)
- api/index.php: new tesseractReadExpiry() — decodes base64 image, pre-processes with GD (2× upscale, greyscale, auto-contrast, sharpen), runs tesseract CLI with ita+eng PSM-6, extracts date with multi-pattern regex (DD/MM/YYYY, MM/YYYY, ISO, named-month), returns YYYY-MM-DD + confidence
- geminiReadExpiry() now: (1) tries Tesseract first; (2) falls back to Gemini Vision if OCR returns null or no date found; (3) passes source ('ocr'|'gemini') in response
@xenova/transformers embedding classifier (browser-side):
- index.html: ES-module bootstrap that lazy-loads 'Xenova/all-MiniLM-L6-v2' quantized (~23 MB, cached in browser) via window._getCategoryPipeline(); pre-warms on first scan page visit
- assets/js/app.js: classifyCategoryByEmbedding(name) — embeds product name + 16 category anchor descriptions, cosine similarity, threshold 0.30; results cached in _embeddingCache Map
- autoDetectCategory(): after keyword map misses, fires classifyCategoryByEmbedding async and updates select when resolved (respects manuallySet flag)
- createQuickProduct(): if regex returned 'altro', silently patches category with embedding result via a background api call
This commit is contained in:
+6
-2
@@ -1,11 +1,15 @@
|
||||
FROM php:8.2-apache
|
||||
|
||||
# Install required PHP extensions
|
||||
# Install required PHP extensions + Tesseract OCR for offline expiry date reading
|
||||
RUN apt-get update && apt-get install -y \
|
||||
libsqlite3-dev \
|
||||
libcurl4-openssl-dev \
|
||||
libonig-dev \
|
||||
&& docker-php-ext-install pdo_sqlite curl mbstring \
|
||||
libgd-dev \
|
||||
tesseract-ocr \
|
||||
tesseract-ocr-ita \
|
||||
tesseract-ocr-eng \
|
||||
&& docker-php-ext-install pdo_sqlite curl mbstring gd \
|
||||
&& apt-get clean && rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Enable Apache mod_rewrite and mod_headers
|
||||
|
||||
Reference in New Issue
Block a user