This article presents a new method for video representation, called trajectory based 3D convolutional descriptor (TCD), which incorporates the advantages of both deep learned features and hand-crafted features. We utilize deep architectures to learn discriminative convolutional feature maps, and conduct trajectory constrained pooling to aggregate these convolutional features into effective descriptors. Firstly, valid trajectories are generated by tracking the interest points within co-motion super-pixels. Secondly, we utilize the 3D ConvNet (C3D) to capture both motion and appearance information in the form of convolutional feature maps. Finally, feature maps are transformed by using two normalization methods, namely channel normalization and spatiotemporal normalization. Trajectory constrained sampling and pooling are used to aggregate deep learned features into descriptors. The proposed (TCD) contains high discriminative capacity compared with hand-crafted features and is able to boost the recognition performance. Experimental results on benchmark datasets demonstrate that our pipeline obtains superior performance over conventional algorithms in terms of both efficiency and accuracy.