About the job
At Hugging Face, we're on a mission to democratize outstanding AI solutions. Our platform is rapidly becoming the premier destination for AI developers, boasting over 5 million users and 100,000 organizations that have shared more than 1 million models, 300,000 datasets, and 300,000 applications. Our open-source libraries have garnered over 400,000 stars on GitHub.
About the Role
As the first Data Infrastructure Advocate Engineer at Hugging Face, you will play a crucial role in connecting innovative data infrastructure with a vibrant community of data engineers, researchers, and developers. You will advocate for Xet storage on the Hugging Face Hub, enabling users to efficiently store, version, and collaborate on large datasets. This position is ideal for someone who excels at the intersection of technical expertise (storage, Parquet, deduplication) and community engagement—helping shape the future of open data workflows.
In this role, you will collaborate with various teams such as Datasets, Hub, and Infrastructure to enhance how developers interact with data on our platform, inspiring a community to create better, faster, and more scalable data pipelines.
Your Key Responsibilities:
- Build and support the open-source data and infrastructure community by launching initiatives, collaborating with data-focused groups, and organizing events or challenges. Engage with communities such as Apache Parquet, Open Tables Formats, and data engineering forums to promote best practices and Hugging Face tools.
- Position the Hugging Face Hub as the leading platform for data storage, versioning, and collaboration by curating and showcasing datasets, benchmarks, and tools like Xet.
- Demonstrate the Hub's value for data workflows by highlighting use cases such as efficient large dataset updates, Parquet editing, and deduplication.
- Develop demos, benchmarks, and tools (e.g., Colab notebooks) to showcase best practices for data storage and versioning. Experiment with Xet, Parquet, and other data formats to reveal their potential in machine learning and data engineering.
- Create informative tutorials, blog posts, and videos that simplify complex topics.
- Share valuable insights on storage optimization, dataset versioning, and deduplication to empower developers.
- Engage actively in online communities (Discord, GitHub, forums) to showcase contributions, answer queries, and encourage collaboration.
- Ensure comprehensive documentation for datasets and tools released on the Hub, including clear examples, benchmarks, and use cases.
About You
You are an ideal candidate if you:
- Possess strong technical skills in Python, data libraries (e.g., pandas, pyarrow, huggingface/datasets), and storage systems (Parquet, Open Table Formats, S3).
- Are a hands-on developer who enjoys experimenting with data tools, storage optimization, and dataset versioning.
- Can clearly articulate complex concepts to varied audiences.

