Generally a good method to avoid this is to randomly shuffle the data prior to each epoch of training. http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/