About the job
About the Role
We are on the lookout for an exceptional Lead Data Engineer to join our innovative tech team at Linkup. You will have the opportunity to collaborate closely with our CTO, Denis Charrier, and contribute to transformative projects that drive our mission forward.
In this pivotal role, your primary responsibilities will include:
Designing and implementing large-scale, high-throughput data pipelines that fuel AI-driven web search capabilities.
Architecting and managing distributed ETL / ELT workflows for web indexing, embeddings, and analytics.
Creating and refining data ingestion and transformation systems for diverse structured and unstructured data sources.
Overseeing and enhancing data storage solutions (SQL, NoSQL, and object stores) to ensure peak performance and reliability.
Engaging in architecture discussions and contributing to the design of our data infrastructure.
Collaborating with machine learning and backend teams to deliver real-time, high-quality data for effective ranking and retrieval.
Writing clean, maintainable code and actively participating in code reviews to uphold software quality.
Qualifications
You will be an ideal candidate if you:
Are passionate about shaping the future of AI and resonate with Linkup’s vision.
Possess a minimum of three years of solid experience in data engineering or backend development.
Have a strong grasp of distributed systems, data modeling, and streaming architectures.
Value data quality, observability, and system reliability in your work.
Enjoy tackling complex pipeline challenges and optimizing for latency or throughput.
Are inquisitive about AI infrastructure, indexing systems, and retrieval-augmented generation (RAG).
Thrive in a dynamic, fast-paced startup environment where ownership and initiative are encouraged.
Example Projects
Stream and transform billions of web facts daily, adhering to strict freshness SLAs (< 5 min).
Develop a real-time data lake to support embeddings, metadata, and ranking features for LLMs.
Design and build fault-tolerant pipelines for continuous crawling, deduplication, and data enrichment.
Create internal tools to monitor and analyze data flows across vast terabytes of web-scale content.
