Skip to main content

Arabic Dialect Data for Production AI

We provide curated, licensed and rights-cleared Arabic dialect datasets ready for AI training and evaluation.

We help teams avoid blended-dialect training data by delivering locality-aware Arabic datasets across Levantine, Iraqi, and Yemeni programs.

Inventory changes monthly. Request the latest snapshot in the sample pack.

What we do

We provide curated, licensed and rights-cleared Arabic dialect datasets (speech, video, and text) that are ready for AI training.

We solve a common buyer problem: AI models trained on blended dialect data that miss local speech patterns and reduce performance in production use cases.

Each recording includes dialect (country and region), speaker age and gender, recording environment details (indoor/outdoor and noise level), transcript availability, and consent/license status.

How we work

Dialect identification

Our linguists classify dialects in client-supplied audio/text and return structured labels.

Data collection

Our local teams record conversations in specific dialects and supply metadata about speakers and environments.

Academic licensing

We partner with regional universities to license Arabic text corpora, including ~4 million MSA words from Egypt, and can source additional corpora.

What's in our dataset

40+ hours of conversational Lebanese Arabic data (audio/video), captured through licensed and rights-cleared workflows across Lebanon.
Licensed Egyptian academic text corpus (~4 million words) through university partnerships, with commercial usage rights.
Inventory snapshot (March 2026): 40 audio hours, 40 video hours, 1,800+ clips, 150+ conversations, and ~100 speakers.
Modalities in hand: audio, video, and text.
We can classify your Arabic audio/text by dialect, country, region, and city, returning structured labels and metadata.

What we can collect to spec

We run the full lifecycle from intake to delivery: sourcing, consent, recording, annotation, QA, and packaging. Custom scopes and partner-supplied ingestion are both supported.

Custom collection to spec: dialect, country/region, recording environment, and target volume.
Current custom programs cover Levantine, Iraqi, and Yemeni dialects; Algeria and Morocco are available on request.
We partner with universities to license Arabic texts; our current corpus includes ~4 million MSA words from Egypt.
Partner-supplied workflow: bring licensed data and we handle intake, rights checks, and export packaging.
Buyer-ready exports for training, evaluation, and pilot testing with JSON/CSV manifests and a README.
Milestone payments: 10% at kickoff, 40% at midpoint, 50% after final delivery and QA approval.

Who we serve

LLM labs expanding Arabic pretraining and post-training corpora
Speech teams building ASR, TTS, and voice assistant products
AI startups needing safe, licensed Arabic data for production
Research groups needing clear provenance and reproducible metadata

Licensing overview and sample pack

Evaluation sample packs are for evaluation only.

Production use requires an executed production license agreement.

Typical production paths include commercial non-exclusive, research-only, and custom terms.

  • Sample pack contents: 3-10 representative clips by dialect/environment.
  • Sample JSON/CSV manifest, schema docs, and README.
  • Licensing summary and evaluation terms.

Request a sample pack or tell us your dialect/volume requirements.

We'll respond within two business days.