nbcapp

New Brown Corpus?

a richly grounded dataset inspired by child language acquisition

simultaneous visual, spatiotemporal, and audio data

more like child-directed speech than other datasets

recorded in VR, which allows semi-realistic motion and interaction

6 kitchen configurations across 2 distinct visual styles

relatively small: 18k words recorded across 108 sessions

unlabeled: no ground truth labels or segmentation

available for download on github

Spatiotemporal data

Spatial parameters are recorded every frame for the head, hands, and each object in the scene.

parameter (type)	description
pos (xyz)	absolute cartesian position of object center
rot (xyzw)	absolute quaternion rotation of object
vel (xyz)	absolute velocity of object center
relPos (xyz)	position relative to head
relRot (xyzw)	rotation relative to head
relVel (xyz)	velocity from frame of reference of the head
bound (xyz)	distance from object center to edge of bounding box
inView (bool)	whether object is in the participant's field of view

example of y-position data (height) when picking up an apple: