high performance priority queues & ruby

Recently, I explored utilizing a priority queue for a project I am working on in ruby that involves reading in more data that can fit in RAM by multiples, and also writing out that data in realtime. For a small portion of this external sorting mechanism, I wanted to use a priority queue. However, ruby does not have a priority queue on tap to use. It does utilize gems as sometimes standard integrations, or otherwise popular solutions for install. One of the most commonly utilized priority queues in ruby was written in the 2008 Google Summer of Code project. In fact, it is so commonly used and heavily integrated with the development community, it is the default repo for:

$ gem install algorithms

Surprisingly, the algorithms priority queue was prohibitvely slow for my needs. I thought I might be inadvertently causing this degradation in my implementation of this queue with my larger project, but isolated benchmark testing with, real, cpu, and system time with garbage collection purges between each run, as well as ips, confirmed the slow performance. For a sanity check, it turns out others were having issues with the priority queue in this gem as well.

Next, I went to the code to look at the data structures and algorithms used to see if there were any obvious disadvantages for runtimes. Simply looking at the code showed the priority queue container utilized a heap. Using a heap is a common, and agreed to be a generally optimized implementation. A binary heap is a heap with up to two children. A heap is a specialized data structure that satisfies the heap property. For a max heap, the value of the parent node is either greater than or equal to the value of the child nodes.

heap

Note that a heap is not a binary search tree (BST). A binary search tree has a more specific ordering property, where all left descendents must be less than or equal to all right descendents.

We can summarize some of the most commonly known implementations with the following analysis, where M is the max element, and N is the number of elements in the queue.

implementation time space insert del max max
naive unordered array M N M 1 N N
naive ordered array M N M N 1 1
binary heap N log M M log N log N log N

queue the code

To understand what could be causing the issue despite the priority queue using a heap, I decided to make my own optimized priority queue with a heap in ruby, so I could compare my performance to the standard gem.

There are multiple types of data structures that can be used to implement the abstract tree structures of a heap, but I used an array to implement a binary heap, where the children nodes of parent node at element index k, are located at child indices 2k and 2k + 1. To make the math for referencing parent and child indices from eachother correct across the array, the first indice 0 is nil. The values from the heap example above are located in the array elements according to the indice values, for reference.

array

getting the class together
class PriorityQueue
  attr_reader :elements, :size

  def initialize
    @elements = [nil]
    @size = nil
  end

  def size
    @size = (@elements.size - 1)
  end

  def isEmpty? 
    return @size == 0
  end

  # compares <= 1 + logN
  def <<(element)
    @elements << key
    size
    swim(size)
  end
sink

While iterating through the elements of the array, we have to find the largest child node, as the heap ordering does not specify which child node should be larger than the other, just that both <= the parent. If the parent has children, and is less than than a child node, the parent will exchange places, or indices in the array, with the child node, promoting the child node to match is priority with a higher ordering in the heap array, while the parent node gets demoted to be a lesser child node. We then set the indice, now again analyze the parent, newly demoted to a child, as a new parent to its current children, to see if it should be demoted.

  private
  def sink(k)
    # j = child
    while 2*k <= size
      j = 2*k
      # get the largest child
      if j < size && less?(j, j+1)
        j += 1
      end
      #if parent not < largest child
      if !less?(k,j)
        break
      end
      exch(k, j)
      k = j
     end
  end
swim

If the newly inserted element, now a child, is anything other than the first element, or the root at index 1, and the parent is less, the parent will swap places, or indices in the array, with the child. The new indice of the promoted child/inserted element, now parent to its previous parent, will be considered again if it is not the root of the entire heap, to see if it is larger than its new set of parent.


  private
  def swim(k)
    # while not root and k's parent < k
    while k > 1 && less?(k/2, k) 
        exch(k, k/2)
        k = k/2
    end
  end
pop

After a highest priority element is popped off the top by being recorded and later returned as max, the value exchanges array indexes in the heap array with the lowest priority element. The size of the array is downsized to delete the highest priority element which is no longer apart of the heap. The previous lowest priority element, is now at the top top of the heap, and needs to be reordered or “sink” to its appropriate priority.


  # compares <= 2LogN
  def pop
    max = @elements[1]
    exch(1, size)
    @elements.pop
    size
    sink(1)
    return max
  end

array helpers
  private
  def less?(i, j)
    return (@elements[i] < @elements[j])
  end

  private
  def exch(i, j)
    @elements[i], @elements[j] = @elements[j], @elements[i]
  end
a mini example
q = PriorityQueue.new
q << 2
q << 5
q << 7
p q.elements # => [nil, 7, 2, 5]
# q << ...

make no assumptions

Now it was time for some microbenchmarking, utilizing a newly made priority queue. This should lead to some pretty straight forward comparisons also using a heap, in the same language as the troublesome gem.

I used benchmark.bmbm to calibrate for memory allocation and garbage collections in the subsequent runs.

arr = [1E3, 1E4, 1E5]
arr.each do |n|
   Benchmark.bmbm do |x|
     # google summer of code/ standard gem heap implementation
     fib_heap = Containers::PriorityQueue.new
     my_pq = New_Max_Pq.new

     # push
     x.report("heap: Size: #{n}") { for i in 1..n; fib_heap.push(i,i); end }
     x.report("my pq: Size #{n}") { for i in 1..n; post << i; end }

     # pop
     x.report("heap: Size: #{n}") { for i in 1..n; fib_heap.pop end }
     x.report("my pq: Size #{n}") { for i in 1..n; my_pq.pop ; end }
  end
end

While the original benchmark was set up to test up to 400 million inserted objects and pop them off, I didn’t need to wait up all night to see the results.

screenshots or it didn’t happen

results

As it turns out, my code was faster, much faster.

I should have been excited, but mostly I was confused by the large discrepency. To create a more detailed round of performance testing, I decided to start by comparing code for the insertions. Both queues used heaps, but I had not looked into the separate container for the heap instantiated by the priority queue class in the default ruby algorithms gem.

Outside of some obvious increase in object class instantiations for nodes, I realized insertion used a merge function, and linked lists. Linked Lists! Merge functions! What is going on here? It turns out the ruby algorithms gem doesn’t use a normal heap, it uses a fibonacci heap. My first reaction was, what is a fibonacci heap?

reverse engineering amortised runtimes

It seems that a fibonacci heap is an exercise for optimizing in amortised analysis. In fact, both the formal documentation, wikipedia, and all the sources I could find stated that it was extremely slow performance wise, and generally accepted to be non ideal for realtime applications, due to large memory overhead, and the realization of worst cases in realtime. I looked at rubyworks, and it also uses a fibonacci heap. Instead of going into how the concept of potential is used to justify the amortised runtime, I wanted to know more about about what is happening in real time.

data structure

A fibonacci heap is a doubly linked list that contains the root nodes of a collection of heap ordered trees, where no two heaps have the same order. Each node can contain up to four pointers. A pointer to the minimum node of the root list is always kept up to date.

fib

object proliferation

This design indicates alot more objects, utilizing a node class for every element, with up to four pointers. If you are familiar with data structures and runtimes, you should not have to confirm that there should be a high overhead for memory by looking at the code, even without knowing about how the methods work and interact with eachother.

Unlike C, ruby by default has it’s own heap, and performs memory allocation and garbage collection (GC). This is not the same heap as mentioned in the data structures used for the queue above. Programs have a stack used for static memory allocation, and a heap that is used for dynamic memory allocation. Variables allocated on the heap have their memory allocated at run time. Accessing this is slower than accessing stack variables. The heap size is only limited by the size of virtual memory.

The ruby heap is made up of slots. Each slot can hold one object. An object in Ruby is a struct called RVALUE. Each RVALUE is 40bytes in size, which is also the size of a slot. Ruby initializes a heap with preallocated memory. By default the heap size is set to 16MB for Ruby 2.4. In earlier versions, this limit is 8MB. Additionally, the Ruby heap organizes the heap into pages, where each page can hold 408 slots. GC.stat[:heap__sorted_length] returns how many pages Ruby has allocated. Because Ruby initially allocates this memory, it does not reflect how many objects have been allocated into memory. GC.stat[:heap_used] returns how many pages are currently in use, versus simply allocated. This number can also contain live objects as well as free slots. GC.stat[:heap_eden_page_length] returns live objects. GC.stat[:heap_tomb_page_length] returns slots with no live objects that will be used when Eden runs out of space.

Given that most formal documentation on the fibonacci heap evades observation of realtime performance, and ruby garbage collection is rapidly evolving from one release to the next, it is probably best to confirm this a significant contributer to the disparity in performance.

To see how many objects, and thus easily calculate how much memory has been allocated, we can run a test to see how many pages are needed to scale up the priority queue.

allocated_before = GC.stat(:heap_used)
puts "#{allocated_before}"

arr = [1E3, 1E4, 1E5, 1E6, 1E7, 1E8]
arr.each do |n|
  for i in 1..n
    # test myqueue
    # in a separate run test fibonacci heap
  end
end

allocated_after = GC.stat(:heap_used)
puts "#{allocated_after}"

With five runs for each version of the queue, the fibonacci heap consistently requires 2x as many objects as the array based heap I created.

memory allocation

As mentioned above, Ruby initializes the program with a heap size of 16MB. This means ruby does not trigger a garbage collection run to see if it can first get rid of stuff it doesn’t need anymore, which takes a relatively long time, or go ask the kernel for more memory, which also takes a relatively long time, until the ruby programs needs more than 16MB. Ruby GC waits until all the slots in the preallocated heap are full, before it begins to see which objects it does not need anymore, and frees them. If the slots are still full, and in this case they will be, as all loaded elements will still be in the array, then ruby will allocate more memory.

Ruby GC has internal variables that can be tuned from their default values. RUBY_GC_HEAP_INIT_SLOTS allocates the initial number of slots on the Ruby heap. The default value is 1000. When Ruby does need to allocate more memory, it gets more than it originally had by a factor of the current amount of memory RUBY_GC_HEAP_GROWTH_FACTOR, and utilizes this factor every time it scales up its memory allocation. Currently, this value is set as a default to 1.8x. If RUBY_GC_HEAP_GROWTH_FACTOR is set to allocate more memory than is needed, it will result in unecessary latency, allocating much more memory than is needed, slowing down applications that will not use all of the next allocation of memory.

Amongst the community developers for Ruby GC, Sam Saffron had originally brought up this issue here to decrease the default growth factor, but for now it is left as a tunable parameter. Initially, tuning it to 1.3 is a good first edit to compare performance for applications that do not have exponential object proliferation or otherwise growth expected in the application.

In this case, the priority queue I made for testing above has insertions loading onto the element queue with 8 byte integers. This means I can load about 2 million integers into my element based priority queue, give or take some for the overhead of the class, before ruby triggers more memory allocation.

This could partially explain why between the second and third run, the fibonacci heap runtime increased 123x, with a signifcantly higher ratio in system time, as it had to sweep the heap for recycleable memory, and allocate 1.8x the current amount of memory multiple times. This should be tuned to fit the number of live objects after a commonly booted process or Rails application is fully booted to avoid GC runs for initialization.

You can increase RUBY_GC_HEAP_INIT_SLOTS to tailor this limit to the size of your program or Rails application. Increasing RUBY_GC_HEAP_INIT_SLOTS for both queues in this case. would scale down the performance discrepency between the two, but not eliminate it.

delayed heap ordering

This is because the fibonacci heap has delayed heap ordering after insertion, which is deferred to the extract minimum function. (The standard fibonacci heap defaults to a min heap, but can be utilized with a max heap.) The extract minimum function removes the minimum. If the minimum had any children, it adds the children to the root list each as their own heap, with their subheaps, if they have any.

Then, it invokes a merge operation. The merge operation merges all of the heaps of the same order. After this, the minimum is then set to the new minimum in the list. The fact that the most expensive methods are not apart of the insertion, further confirms the significance of memory overhead contributing to the repeated insertions with no intermittent extractions, from the benchmark.bmbm test above.

the worst case is a likely case

While the fibonacci heap advertises O(log(n)) amortised running times for the extract minimum and delete methods, they have a worst case of O(n) due to the deferred heap ordering. This is because the heap ordering is not limited to O(log(n)) convenience from having a balanced tree.

This is not a degradation in performance from an unlikely arrangement of input values. The heap ordering will happen intermittently. This results in some operations running very fast, and some operations running very slow. More detailed explanations of the standard methods can be found here

lessons learned

code

Read the code first, documentation second

know your GC and ObjectSpace

Ruby has had many changes with garbage collection algorithms, and other auxiliary implementations since the 1.9 MRI release. GC.stat metrics have always been undefined in the formal documentation of all releases.

Understanding how GC impacts GC.stat outputs leads to understanding the results of the metrics, instead of relying on their potentially inconsistent and unintuitive output. There are alot of memory profile gems released that were very helpful, that are not compatible with Ruby 2.4 or above. Looking through the garbage collector will be helpful in creating a memory_profiler gem going forward.