This paper presents an unsupervised, multimodal, neural network model of early child language acquisition that takes into account the child’s communicative intentions as well as the multimodal nature of language. The model exhibits aspects of one-word child language such as generalisation to new and unforeseen utterances, a U-shaped learning trajectory and a vocabulary spurt. A probabilistic gating mechanism that predisposes the model to utter single words at the onset of training and two-words as training progresses enables the model to exhibit the gradual and continuous transition between the one-word and two-word stages as observed in children.