Home Processing pgSQL query results in batches
Reply: 1

Processing pgSQL query results in batches

Anna Kizilova
1#
Anna Kizilova Published in 2017-11-12 22:15:05Z

I've written the rake task to perform a postgreSQL query. The task returns an object of class Result.

Here's my task:

task export_products: :environment do
  results = execute "SELECT smth IN somewhere"
    if results.present?
      results
    else
      nil
    end
end

def execute sql
  ActiveRecord::Base.connection.execute sql
end

My further plan is to split the output in batches and save these batches one by one into a .csv file. Here I get stuck. I cannot imagine how to call find_in_batches method of ActiveRecord::Batches module for PG::Result.

How should I proceed?

Edit: I have a legacy sql query to a legacy database

mu is too short
2#
mu is too short Reply to 2017-11-13 01:42:59Z

If you look at how find_in_batches is implemented, you'll see that the algorithm is essentially:

  1. Force the query to be ordered by the primary key.
  2. Add a LIMIT clause to the query to match the batch size.
  3. Execute the modified query from (2) to get a batch.
  4. Do whatever needs to be done with the batch.
  5. If the batch is smaller than the batch size, then the unlimited query has been exhausted so we're done.
  6. Get the maximum primary query value (last_max) from the batch you get in (3).
  7. Add primary_key_column > last_max to the query from (2)'s WHERE clause, run the query again, and go to step (4).

Pretty straight forward and could be implemented with something like this:

def in_batches_of(batch_size)
  last_max = 0 # This should be safe for any normal integer primary key.
  query = %Q{
    select whatever
    from table
    where what_you_have_now
      and primary_key_column > %{last_max}
    order by primary_key_column
    limit #{batch_size}
  }

  results = execute(query % { last_max: last_max }).to_a
  while(results.any?)
    yield results
    break if(results.length < batch_size)
    last_max = results.last['primary_key_column']
    results = execute(query % { last_max: last_max }).to_a
  end
end

in_batches_of(1000) do |batch|
  # Do whatever needs to be done with the `batch` array here
end

Where, of course, primary_key_column and friends have been replaced with real values.

If you don't have a primary key in your query then you can use some other column that sorts nicely and is unique enough for your needs. You could also use an OFFSET clause instead of the primary key but that can get expensive with large result sets.

You need to login account before you can post.

About| Privacy statement| Terms of Service| Advertising| Contact us| Help| Sitemap|
Processed in 0.308396 second(s) , Gzip On .

© 2016 Powered by mzan.com design MATCHINFO