For compressing large-scale multidimensional datasets, out-of-core tensor decomposition often consumes a lot of time. This article particularly presents a method based on two key ideas to improve its performance. First, cache-aware static scheduling schemes are employed to reduce the total number of disk accesses. Second, we take advantage of the massively parallel computing power and large memory size of modern GPUs to accelerate linear algebra operations of tensor decomposition. Our experiments demonstrate that the proposed method can achieve speedups of 11~16 over a naive implementation and 2.5~5.3 over previous work [43] for practical data-driven rendering applications.