Our experiments might support this.
With 4 or 8 titan X, training time is shortened from more than one week to one or two days.
code available at https://github.com/yjxiong/caffe
Data parallel is easy to implement and leads to linear training time speed ups, as per 1 week to 2 days with 4x hardware. But not bigger models.
Model parallelism leads to bigger models and is what I've been referring to. It is overkill and doesn't work well anyway. Frameworks should not be expected to support it because there are more interesting topics in deep learning they could support instead.
Better research/methods will come out at some point, at which point this calculus will change, but not yet! Today it is the very definition of premature optimization in nearly all cases.