Senior Backend Engineer for Smart PDF Extraction and Next.js Migration
Upwork

Remoto
•Hace 3 horas
•Ninguna postulación
Sobre
We are building a web application that allows users to upload documents and chat with their content. The core challenge is accurate document content extraction at scale, with OCR used only when strictly necessary, and with precise bounding boxes to enable high-quality text highlights inside a PDF viewer. This is not a basic OCR task. The focus is precision, performance, low operational cost, and backend robustness. We are looking for a senior-level engineer who understands document processing pipelines, OCR optimization, and production-ready backend systems, and who can also lead and execute the migration of the application to Next.js to improve SEO, performance, and rendering strategy. ________________________________________ Scope of Work (Milestone-Based) • Milestone 0 – Technical Audit Duration: 1–2 days Deliverables: • Review of current backend architecture • Review of current frontend architecture (React + Firebase) • Identification of technical, performance, SEO, and cost risks • Proposed OCR, security, and rendering architecture • Clear and prioritized implementation and migration plan ________________________________________ • Milestone 1 – Smart Document Extraction Duration: 3–12 days The system must handle PDFs and other document formats, including: • .pdf • .doc • .docx • .ppt • .pptx • .odt • .odp • .txt • .rtf • .md • .html • .htm • .jpg • .jpeg • .png Document & page-level detection strategy: • If the document is 100% selectable text, no OCR must be applied • Native text must be extracted directly • Bounding boxes must be generated from embedded text when possible • If the document is mixed (text plus scanned or image-based pages), OCR must be applied only to pages without selectable text • Pages that already contain text must never go through OCR • Page-level processing state must be persisted • If the document is 100% scanned or image-based, full Tesseract OCR must be avoided • A low-cost vision AI must be used to generate usable textual descriptions per page • Output should focus on titles, sections, tables, and key fields • Example models: Gemini Flash / Flash-Lite or equivalent ________________________________________ • Milestone 2 – Text Highlights (Bounding Boxes) Duration: 3–4 days Implement precise highlights. Key concepts: • A bounding box is the exact rectangle enclosing a word or text fragment in page coordinates, not screen coordinates • Highlights are visual overlays, not text selections Workflow: • Backend returns text and bounding boxes • Frontend renders the PDF • Frontend draws semi-transparent rectangles using bounding box coordinates • Highlights must remain accurate under zoom and responsive layouts ________________________________________ • Milestone 3 – Critical Bugs (P0) Duration: 3–5 days • Authentication and login stability • PDF upload flow • Backend crashes • Firestore security rules • Storage rules • App Check configuration ________________________________________ • Milestone 4 – Backend Protection Duration: 2–3 days • Rate limiting per user • File size validation • Page count validation • Clear logging for debugging and monitoring • Abuse prevention mechanisms Without proper limits, a single user could upload thousands of documents and trigger massive OCR costs. ________________________________________ • Milestone 5 – Migration to Next.js Duration: 4–8 days • Migrate the existing React application to Next.js • Define the correct rendering strategy (SSR, SSG, ISR, or hybrid) • Improve SEO, metadata handling, and indexing • Optimize Core Web Vitals (LCP, INP, CLS) • Ensure compatibility with existing backend, auth, and storage • Avoid breaking existing functionality during migration ________________________________________ • Milestone 6 – Stability & Performance Duration: 2–4 days • Function optimization • Reduced cold starts • Improved error handling • Overall backend reliability




