About the job
Hugging Face is on an inspiring mission to make exceptional AI accessible to everyone. With a rapidly expanding platform catering to over 5 million users and 100,000 organizations, we have collectively shared more than 1 million models, 300,000 datasets, and 300,000 applications. Our open-source libraries have garnered over 400,000 stars on GitHub, marking our commitment to innovation and collaboration.
About the Role
As our inaugural Data & Infrastructure Advocate Engineer, you will play a pivotal role in connecting advanced data infrastructure with our global community of data engineers, researchers, and developers. Your mission will be to advocate for Xet storage on the Hugging Face Hub, empowering users to efficiently store, version, and collaborate on extensive datasets. This position is ideal for individuals who excel at the intersection of technical expertise (storage, Parquet, deduplication) and community engagement—contributing to the evolution of open data workflows.
In this role, you will work closely with teams such as Datasets, Hub, and Infrastructure to enhance the developer experience with data on our platform, inspiring a community to build superior, scalable data pipelines.
Your Key Responsibilities:
- Foster the open-source data/infrastructure community—initiate programs, collaborate with data-centric groups, and organize events or challenges. Engage with communities like Apache Parquet, Open Tables Formats, and data engineering forums to advocate for best practices and Hugging Face tools.
- Establish the Hugging Face Hub as the premier platform for data storage, versioning, and collaboration—curate and present datasets, benchmarks, and tools such as Xet.
- Demonstrate practical use cases like efficient large dataset updates, Parquet editing, and deduplication to showcase the Hub’s utility for data workflows.
- Develop demos, benchmarks, and tools (e.g., Colab notebooks) to showcase best practices for data storage and versioning. Experiment with Xet, Parquet, and other data formats to highlight their potential for machine learning and data engineering.
- Create high-quality tutorials, blog posts, and videos that simplify complex subjects.
- Share insights on storage optimization, dataset versioning, and deduplication to empower developers.
- Engage actively in online communities (Discord, GitHub, forums) to highlight contributions, respond to inquiries, and promote collaboration.
- Ensure datasets and tools released on the Hub are thoroughly documented, showcasing clear examples, benchmarks, and use cases.
About You
This role is a great match if you:
- Possess strong technical skills in Python, data libraries (e.g., pandas, pyarrow, huggingface/datasets), and storage systems (Parquet, Open Table Formats, S3).
- Are a hands-on innovator who enjoys experimenting with data tools, optimizing storage, and versioning datasets.
- Can communicate complex technical concepts clearly and effectively.

