Highlights
Transform messy data to gold. Build scalable ML pipelines. Impact real-time speech AI.
Description
Job Summary
pJoin Smallest.ai as a Research Data Engineer and revolutionize the way data is processed for cutting-edge speech, language, and real-time systems. You will transform messy, noisy data into high-quality datasets that power our models.
Responsibilities
- Build high-throughput pipelines for audio, text, and multimodal data
- Design heuristics and ML-based data filtering systems
- Clean, filter, deduplicate, and normalize multilingual data
- Create scalable evaluation datasets across languages and domains
- Develop training data pipelines that continuously improve model performance
Required Skills
- Data processing at scale (audio/text preferred)
- Coding skills in Python (systems experience a plus)
- Multilingual data handling and normalization
- Experience with ML/data pipelines
- Understanding of active learning loops and sampling strategies
Required Skills Explained
- Strong fundamentals in data structures, systems, and pipelines
- Experience with large-scale data processing (audio/text preferred)
- Comfortable working with messy, unstructured, real-world data
- Strong coding skills with Python required; experience with systems is a plus
- Understanding of ML/data pipelines including training, evaluation, and data curation
Who is this for
pIf you thrive on working with raw, chaotic data and are passionate about turning it into a competitive advantage, this role is perfect for you. You should enjoy building systems that directly impact model performance.
Why This Job is a Good Opportunity
ulliPotential to significantly impact model performance by improving data qualityliChallenging yet rewarding role that involves transforming raw data into valuable assets for AI modelsliOpportunity to work on cutting-edge, real-time multilingual voice AI systems with global applications
Interview Preparation Tips
- Prepare examples of how you have improved data quality in previous roles
- Demonstrate your understanding of ML/data pipelines and their importance
- Showcase your experience with large-scale data processing, particularly audio/text data
- Discuss any relevant projects or personal initiatives related to data curation and pipeline optimization
Career Growth in This Role
pThis role offers a pathway to becoming an expert in data engineering for AI systems. With the increasing importance of high-quality data in machine learning, there are numerous opportunities to expand your skill set and contribute to groundbreaking projects.pAs you progress, you might move into more strategic roles within data science or even lead teams focused on improving the overall data ecosystem. The continuous demand for skilled professionals who can handle complex data challenges ensures long-term career growth potential.
Explore More Opportunities
Skills
Frequently Asked Questions
What kind of experience is required?Experience with large-scale data processing, Python coding, and understanding of ML/data pipelines is essential.
Is this role suitable for beginners?No, this position requires a strong foundation in data structures and systems, along with practical experience in data processing.
What benefits can I expect from joining Smallest.ai?Joining Smallest.ai offers the opportunity to work on groundbreaking projects and contribute directly to model performance improvements.