Debugging a Stack Overflow in Rails 7.2.1.1

Vishnu M avatar

Vishnu M

November 25, 2025

Debugging a Stack Overflow in Rails 7.2.1.1

A few weeks ago, after upgrading NeetoCal from Rails 7.1.5.2 to Rails 7.2.1.1, we started seeing mysterious crashes in production with the error message SystemStackError: stack level too deep.

Identifying the problem

The crash happened inside Slots::SyncAllCalendarsService, a service that syncs multiple calendars concurrently using the Async gem. What made it particularly puzzling was that switching from Async::Barrier to Thread.new made the error disappear. This led us down a rabbit hole thinking it was an Async-specific issue.

The stack trace pointed to a strange location:

activerecord (7.2.1.1) lib/arel/visitors/to_sql.rb:898:in `each'
  from lib/arel/visitors/to_sql.rb:898:in `each_with_index'
  from lib/arel/visitors/to_sql.rb:898:in `inject_join'
  from lib/arel/visitors/to_sql.rb:627:in `visit_Arel_Nodes_Or'
  ... 1132 levels...

Over a thousand levels deep into visit_Arel_Nodes_Or. That's unusual. Why would visiting OR nodes in a SQL query cause such deep recursion?

The stack trace mentioned Slots::SyncAllCalendarsService, but that service itself had nothing obviously wrong. It was just iterating through calendars and syncing them. We traced deeper into the code and found the actual culprit in Neeto's Integrations::Icloud::SyncEventsService:

def abandoned_events
  pairs = event_params.map { |e| [e[:url], e[:recurrence_id]] }

  @_abandoned_events = calendar.events
    .where("start_time BETWEEN ? AND ?", start_time, end_time)
    .where.not([:url, :recurrence_id] => pairs)
end

This code finds "abandoned" events by excluding events that match specific URL and recurrence ID combinations. In production, pairs contained 233 entries. Nothing immediately suspicious, right?

To understand what was happening, we needed to see what Rails was actually doing with this query. The where.not([:url, :recurrence_id] => pairs) syntax is Rails' composite key feature. With 233 pairs, Rails generates SQL like this:

WHERE NOT (
  (url = 'url1' AND recurrence_id = 'rec1') OR
  (url = 'url2' AND recurrence_id = 'rec2') OR
  (url = 'url3' AND recurrence_id = 'rec3') OR
  -- ... 230 more OR clauses
)

233 OR clauses. That's a lot, but databases can handle it. The problem wasn't the SQL itself - it was how Rails built the internal representation of that query.

Going into Rails Internals

To really understand what changed, we used bundle open activerecord to look at the Rails source code. We compared the same code in Rails 7.1.5.2 and 7.2.1.1.

What We Found in Rails 7.1.5.2

In Rails 7.1.5.2, the code that visits OR nodes looked like this:

def visit_Arel_Nodes_Or(o, collector)
  stack = [o.right, o.left]

  while o = stack.pop
    if o.is_a?(Arel::Nodes::Or)
      stack.push o.right, o.left
    else
      visit o, collector
      collector << " OR " unless stack.empty?
    end
  end

  collector
end

This is an iterative approach using a manual stack. No matter how deeply nested the OR conditions are, this code only uses a single stack frame for the visit_Arel_Nodes_Or method. It's stack-safe.

What Changed in Rails 7.2.1.1

In Rails 7.2.1.1, the same method became much simpler:

def visit_Arel_Nodes_Or(o, collector)
  inject_join o.children, collector, " OR "
end

Where inject_join does:

def inject_join(list, collector, join_str)
  list.each_with_index do |x, i|
    collector << join_str unless i == 0
    collector = visit(x, collector)  # Recursive call
  end
  collector
end

This is a recursive approach. If an OR node contains another OR node, it calls visit, which calls visit_Arel_Nodes_Or, which calls inject_join again. The recursion depth grows with the nesting depth of the OR tree.

The Tree Structure Problem

But why is the OR tree so deeply nested? We found the answer in Active Record's PredicateBuilder#grouping_queries:

def grouping_queries(queries)
  if queries.one?
    queries.first
  else
    queries.map! { |query| query.reduce(&:and) }
    queries = queries.reduce { |result, query| Arel::Nodes::Or.new([result, query]) }
    Arel::Nodes::Grouping.new(queries)
  end
end

That reduce call is the key. It builds a left-deep nested tree:

Level 1:  Query1 OR Query2 = OR_Node_1
Level 2:  OR_Node_1 OR Query3 = OR_Node_2
Level 3:  OR_Node_2 OR Query4 = OR_Node_3
...
Level 232: OR_Node_231 OR Query233 = OR_Node_232

With 233 pairs, this creates 232 levels of nesting. Each level adds approximately 5 stack frames when Rails traverses it recursively. That's 1,160 stack frames just for the query building.

The Async Connection

Why did Thread.new work but Async::Barrier didn't?

We wrote a test script to measure the actual stack depth limits in different contexts:

def test_recursion_depth(context_name)
  depth = 0

  recurse = lambda do |n|
    depth = n
    recurse.call(n + 1)
  end

  begin
    recurse.call(0)
  rescue SystemStackError
    return depth
  end
end

thread_depth = nil
Thread.new do
  thread_depth = test_recursion_depth("thread")
end.join
puts "Thread recursion limit: ~#{thread_depth} calls"

async_depth = nil
Async do
  async_depth = test_recursion_depth("async")
end
puts "Async fiber recursion limit: ~#{async_depth} calls"

When we ran this, the results were revealing:

Thread recursion limit:      ~11910 calls
Async fiber recursion limit: ~1482 calls

Threads have roughly 8x more stack space than Async fibers. This is because threads use larger stacks (1MB-8MB) while fibers are designed to be lightweight with smaller stacks (512KB-1MB). This difference in stack allocation is why threads can handle deeper recursion before hitting the overflow limit.

With 233 pairs requiring ~1,160 stack frames, we were right at the edge of the Async fiber limit. But we were still well within the thread limit, which is why switching to Thread.new seemed to fix it.

When we tested with 500 pairs (which some of our larger calendars had), even Thread.new failed with a stack overflow. So it wasn't really a fix, it just pushed the problem further down the road.

The GitHub Trail

After we understood the problem, we searched the Rails repository and found the exact history:

  1. April 4, 2024 - PR #51492: This PR changed OR nodes from binary to n-ary to handle multiple children in a single node.

  2. September 24, 2024 - Issue #53031: Someone reported the exact issue we were seeing. Queries with 800+ OR conditions that worked in Rails 7.1 now crashed in Rails 7.2.

  3. September 25, 2024 - PR #53032: fatkodima fixed it with a one-line change in PredicateBuilder#grouping_queries:

Before:

queries = queries.reduce { |result, query| Arel::Nodes::Or.new([result, query]) }

After:

queries = Arel::Nodes::Or.new(queries)

The difference is profound. The old code using reduce created a deeply nested structure like Russian nesting dolls:

Or([Or([Or([Query1, Query2]), Query3]), Query4])  # 232 levels with 233 pairs

The new code creates a flat structure:

Or([Query1, Query2, Query3, Query4, ...Query233])  # Just 1 level

With a flat structure, the recursive visitor only traverses one level instead of 232, eliminating the stack overflow. The fix is available in Rails 7.2.2+.

Our Solution

We're currently on Rails 7.2.1.1 and upgrading immediately would require extensive testing. So we implemented a workaround in our code:

def abandoned_events
  return @_abandoned_events if defined?(@_abandoned_events)

  pairs = event_params.map { |e| [e[:url], e[:recurrence_id]] }
  return if pairs.empty?

  events_in_time_range = calendar.events
    .where("start_time BETWEEN ? AND ?", start_time, end_time)

  # Get candidate events using simple IN clauses
  urls = pairs.map(&:first).uniq
  recurrence_ids = pairs.map(&:last).uniq

  candidate_events = events_in_time_range
    .where(url: urls)
    .where(recurrence_id: recurrence_ids)

  # Filter to exact pairs in memory
  pairs_set = pairs.to_set
  ids_to_exclude = candidate_events.select { |event|
    pairs_set.include?([event.url, event.recurrence_id])
  }.map(&:id)

  @_abandoned_events = if ids_to_exclude.empty?
    events_in_time_range
  else
    events_in_time_range.where.not(id: ids_to_exclude)
  end
end

This approach:

  • Uses simple IN clauses instead of complex OR conditions
  • Filters the exact pairs in memory (fast with a Set)
  • Excludes by ID, which is a flat list
  • Never builds deeply nested Arel nodes

The trade-off is we run 2 queries instead of 1, but the queries are simpler and more efficient. The in-memory filtering is negligible since we're already constrained by time range.

Measuring and Verifying

To prove our hypothesis, we wrote verification scripts that patched Rails internals to measure what was actually happening.

Counting Arel Visitor Calls

We created a script that patched visit_Arel_Nodes_Or to count how many times it was called and how deep the recursion went:

module Arel
  module Visitors
    class ToSql
      alias_method :original_visit_Arel_Nodes_Or, :visit_Arel_Nodes_Or

      def visit_Arel_Nodes_Or(o, collector)
        @or_call_count ||= 0
        @or_call_count += 1
        @max_depth ||= 0
        @current_depth ||= 0
        @current_depth += 1
        @max_depth = [@max_depth, @current_depth].max

        result = original_visit_Arel_Nodes_Or(o, collector)

        @current_depth -= 1
        result
      end
    end
  end
end

Results with 50 test pairs:

  • Rails 7.1.5.2: visit_Arel_Nodes_Or was called 1 time and max recursion depth was 0.
  • Rails 7.2.1.1: visit_Arel_Nodes_Or was called 49 times and max recursion depth was 100.

With 233 pairs in production:

  • Rails 7.1.5.2: Still called 1 time, depth 0 (iterative approach)
  • Rails 7.2.1.1: Called 232 times, depth ~466 (exceeds Async fiber limit)

Testing the Actual Breaking Points

We ran the problematic query with different pair counts to find exactly where it breaks:

# Test with Async
Async do
  pairs = calendar.events.limit(count).pluck(:url, :recurrence_id)
  calendar.events.where.not([:url, :recurrence_id] => pairs).count
end.wait

# Test with Thread
Thread.new do
  pairs = calendar.events.limit(count).pluck(:url, :recurrence_id)
  calendar.events.where.not([:url, :recurrence_id] => pairs).count
end.join

Results:

  • Async Fiber: Breaks at approximately 233 pairs
  • Thread.new: Breaks at approximately 500 pairs

This confirmed that Thread.new wasn't a real solution - it just had more headroom before hitting the same problem.

For now, we're sticking with our workaround. It works reliably, performs well, and doesn't require upgrading Rails immediately. When we do upgrade to Rails 7.2.2+, we can revert to the original clean syntax, knowing that the Rails team has fixed the underlying issue.

If you're on Rails 7.2.0 through 7.2.1.x and use composite key queries with large datasets, watch out for this issue. The fix is in Rails 7.2.2+, or you can work around it like we did.

References

If this blog was helpful, check out our full blog archive.

Stay up to date with our blogs.

Subscribe to receive email notifications for new blog posts.