feat: offline OCR (Tesseract) + embedding category classifier (@xenova/transformers)

Tesseract OCR (PHP, server-side):
- Dockerfile: adds tesseract-ocr + tesseract-ocr-ita + libgd-dev (gd extension)
- api/index.php: new tesseractReadExpiry() — decodes base64 image, pre-processes with GD (2× upscale, greyscale, auto-contrast, sharpen), runs tesseract CLI with ita+eng PSM-6, extracts date with multi-pattern regex (DD/MM/YYYY, MM/YYYY, ISO, named-month), returns YYYY-MM-DD + confidence
- geminiReadExpiry() now: (1) tries Tesseract first; (2) falls back to Gemini Vision if OCR returns null or no date found; (3) passes source ('ocr'|'gemini') in response

@xenova/transformers embedding classifier (browser-side):
- index.html: ES-module bootstrap that lazy-loads 'Xenova/all-MiniLM-L6-v2' quantized (~23 MB, cached in browser) via window._getCategoryPipeline(); pre-warms on first scan page visit
- assets/js/app.js: classifyCategoryByEmbedding(name) — embeds product name + 16 category anchor descriptions, cosine similarity, threshold 0.30; results cached in _embeddingCache Map
- autoDetectCategory(): after keyword map misses, fires classifyCategoryByEmbedding async and updates select when resolved (respects manuallySet flag)
- createQuickProduct(): if regex returned 'altro', silently patches category with embedding result via a background api call
This commit is contained in:
dadaloop82
2026-05-03 13:17:14 +00:00
parent c814d99d1f
commit a6c2fb93cf
4 changed files with 363 additions and 13 deletions
+6 -2
View File
@@ -1,11 +1,15 @@
FROM php:8.2-apache
# Install required PHP extensions
# Install required PHP extensions + Tesseract OCR for offline expiry date reading
RUN apt-get update && apt-get install -y \
libsqlite3-dev \
libcurl4-openssl-dev \
libonig-dev \
&& docker-php-ext-install pdo_sqlite curl mbstring \
libgd-dev \
tesseract-ocr \
tesseract-ocr-ita \
tesseract-ocr-eng \
&& docker-php-ext-install pdo_sqlite curl mbstring gd \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# Enable Apache mod_rewrite and mod_headers