Sonntag, 23. August 2015

Batch huge ActiveRecord collections!

To process large amounts of data may force the server to its knees. For example, a background job processing statistical data. The same can also apply to data migrations.
Much of the resources are consumed by ActiveRecord object instantiations. If there is no alternative to instantiate the ActiveRecord objects, Ruby on Rails provides the possibility of batch processing with ActiveRecord::Batches#find_in_batches.
Therefore, instead of:
Order.all.each(&:calculate_summer_sale!)
The example is trivial, but in the case of many orders, it is not predictable how much load will be on the production server. In fact the entire set of objects is loaded and held in the memory until the final pass is completed.
However, the risk can be restricted by batching:
Order.where(invoiced: false).find_in_batches(batch_size: 500) do |orders|
  orders.each(&:calculate_summer_sale!)
end
In the example, 500 orders are processed and loaded into memory per batch run.
The :batch_size is optional. The default value is 1000.
The orders collection is processed with each only. Then ActiveRecord::Batches#find_each is a more concise version (this time with a named scope):
Order.not_invoiced.find_each(batch_size: 500) do |order|
  order.calculate_summer_sale!
end
Both versions principle are similar. They divide the objects in batches based on the ID, which is why the query methods ActiveRecord::QueryMethods#order and ActiveRecord::QueryMethods#limit can not be used.
Another option is :start. It allows the batches to be split across multiple processes.
For example, the first 20 batches are processed from the first process (each with 1000):
Order.where(id: 1..20_000).find_each do |order|
  order.calculate_summer_sale!
end
and the remaining from the second process:
Order.find_each(start: 20_000).each do |order|
  order.calculate_summer_sale!
end
ActiveRecord::QueryMethods#select should be used cautiously. The example:
Order.select('total').find_each(batch_size: 500) do |order|
  order.calculate_summer_sale!
end
Smaller objects are instantiated and therefore it generates less burden. But only a batch of 500 objects actually is processed, since the batching is based on the primary key, which was not selected. At least the ID would have to be selected with:
Order.select('id, total').find_each(batch_size: 500) do |order|
  order.calculate_summer_sale!
end

Further articles of interest:

Supported by Ruby 2.2.1, Ruby on Rails 4.2.1