A few weeks ago, after upgrading NeetoCal from
Rails 7.1.5.2 to Rails 7.2.1.1, we started seeing mysterious crashes in
production with the error message SystemStackError: stack level too deep.
The crash happened inside Slots::SyncAllCalendarsService, a service that syncs
multiple calendars concurrently using the Async gem. What made it particularly
puzzling was that switching from Async::Barrier to Thread.new made the error
disappear. This led us down a rabbit hole thinking it was an Async-specific
issue.
The stack trace pointed to a strange location:
activerecord (7.2.1.1) lib/arel/visitors/to_sql.rb:898:in `each'
from lib/arel/visitors/to_sql.rb:898:in `each_with_index'
from lib/arel/visitors/to_sql.rb:898:in `inject_join'
from lib/arel/visitors/to_sql.rb:627:in `visit_Arel_Nodes_Or'
... 1132 levels...
Over a thousand levels deep into visit_Arel_Nodes_Or. That's unusual. Why
would visiting OR nodes in a SQL query cause such deep recursion?
The stack trace mentioned Slots::SyncAllCalendarsService, but that service
itself had nothing obviously wrong. It was just iterating through calendars and
syncing them. We traced deeper into the code and found the actual culprit in
Neeto's Integrations::Icloud::SyncEventsService:
def abandoned_events
pairs = event_params.map { |e| [e[:url], e[:recurrence_id]] }
@_abandoned_events = calendar.events
.where("start_time BETWEEN ? AND ?", start_time, end_time)
.where.not([:url, :recurrence_id] => pairs)
end
This code finds "abandoned" events by excluding events that match specific URL
and recurrence ID combinations. In production, pairs contained 233 entries.
Nothing immediately suspicious, right?
To understand what was happening, we needed to see what Rails was actually doing
with this query. The where.not([:url, :recurrence_id] => pairs) syntax is
Rails' composite key feature. With 233 pairs, Rails generates SQL like this:
WHERE NOT (
(url = 'url1' AND recurrence_id = 'rec1') OR
(url = 'url2' AND recurrence_id = 'rec2') OR
(url = 'url3' AND recurrence_id = 'rec3') OR
-- ... 230 more OR clauses
)
233 OR clauses. That's a lot, but databases can handle it. The problem wasn't the SQL itself - it was how Rails built the internal representation of that query.
To really understand what changed, we used bundle open activerecord to look at
the Rails source code. We compared the same code in Rails 7.1.5.2 and 7.2.1.1.
In Rails 7.1.5.2, the code that visits OR nodes looked like this:
def visit_Arel_Nodes_Or(o, collector)
stack = [o.right, o.left]
while o = stack.pop
if o.is_a?(Arel::Nodes::Or)
stack.push o.right, o.left
else
visit o, collector
collector << " OR " unless stack.empty?
end
end
collector
end
This is an iterative approach using a manual stack. No matter how deeply nested
the OR conditions are, this code only uses a single stack frame for the
visit_Arel_Nodes_Or method. It's stack-safe.
In Rails 7.2.1.1, the same method became much simpler:
def visit_Arel_Nodes_Or(o, collector)
inject_join o.children, collector, " OR "
end
Where inject_join does:
def inject_join(list, collector, join_str)
list.each_with_index do |x, i|
collector << join_str unless i == 0
collector = visit(x, collector) # Recursive call
end
collector
end
This is a recursive approach. If an OR node contains another OR node, it calls
visit, which calls visit_Arel_Nodes_Or, which calls inject_join again. The
recursion depth grows with the nesting depth of the OR tree.
But why is the OR tree so deeply nested? We found the answer in Active Record's
PredicateBuilder#grouping_queries:
def grouping_queries(queries)
if queries.one?
queries.first
else
queries.map! { |query| query.reduce(&:and) }
queries = queries.reduce { |result, query| Arel::Nodes::Or.new([result, query]) }
Arel::Nodes::Grouping.new(queries)
end
end
That reduce call is the key. It builds a left-deep nested tree:
Level 1: Query1 OR Query2 = OR_Node_1
Level 2: OR_Node_1 OR Query3 = OR_Node_2
Level 3: OR_Node_2 OR Query4 = OR_Node_3
...
Level 232: OR_Node_231 OR Query233 = OR_Node_232
With 233 pairs, this creates 232 levels of nesting. Each level adds approximately 5 stack frames when Rails traverses it recursively. That's 1,160 stack frames just for the query building.
Why did Thread.new work but Async::Barrier didn't?
We wrote a test script to measure the actual stack depth limits in different contexts:
def test_recursion_depth(context_name)
depth = 0
recurse = lambda do |n|
depth = n
recurse.call(n + 1)
end
begin
recurse.call(0)
rescue SystemStackError
return depth
end
end
thread_depth = nil
Thread.new do
thread_depth = test_recursion_depth("thread")
end.join
puts "Thread recursion limit: ~#{thread_depth} calls"
async_depth = nil
Async do
async_depth = test_recursion_depth("async")
end
puts "Async fiber recursion limit: ~#{async_depth} calls"
When we ran this, the results were revealing:
Thread recursion limit: ~11910 calls
Async fiber recursion limit: ~1482 calls
Threads have roughly 8x more stack space than Async fibers. This is because threads use larger stacks (1MB-8MB) while fibers are designed to be lightweight with smaller stacks (512KB-1MB). This difference in stack allocation is why threads can handle deeper recursion before hitting the overflow limit.
With 233 pairs requiring ~1,160 stack frames, we were right at the edge of the
Async fiber limit. But we were still well within the thread limit, which is why
switching to Thread.new seemed to fix it.
When we tested with 500 pairs (which some of our larger calendars had), even
Thread.new failed with a stack overflow. So it wasn't really a fix, it just
pushed the problem further down the road.
After we understood the problem, we searched the Rails repository and found the exact history:
April 4, 2024 - PR #51492: This PR changed OR nodes from binary to n-ary to handle multiple children in a single node.
September 24, 2024 - Issue #53031: Someone reported the exact issue we were seeing. Queries with 800+ OR conditions that worked in Rails 7.1 now crashed in Rails 7.2.
September 25, 2024 -
PR #53032:
fatkodima fixed it with a one-line change in
PredicateBuilder#grouping_queries:
Before:
queries = queries.reduce { |result, query| Arel::Nodes::Or.new([result, query]) }
After:
queries = Arel::Nodes::Or.new(queries)
The difference is profound. The old code using reduce created a deeply nested
structure like Russian nesting dolls:
Or([Or([Or([Query1, Query2]), Query3]), Query4]) # 232 levels with 233 pairs
The new code creates a flat structure:
Or([Query1, Query2, Query3, Query4, ...Query233]) # Just 1 level
With a flat structure, the recursive visitor only traverses one level instead of 232, eliminating the stack overflow. The fix is available in Rails 7.2.2+.
We're currently on Rails 7.2.1.1 and upgrading immediately would require extensive testing. So we implemented a workaround in our code:
def abandoned_events
return @_abandoned_events if defined?(@_abandoned_events)
pairs = event_params.map { |e| [e[:url], e[:recurrence_id]] }
return if pairs.empty?
events_in_time_range = calendar.events
.where("start_time BETWEEN ? AND ?", start_time, end_time)
# Get candidate events using simple IN clauses
urls = pairs.map(&:first).uniq
recurrence_ids = pairs.map(&:last).uniq
candidate_events = events_in_time_range
.where(url: urls)
.where(recurrence_id: recurrence_ids)
# Filter to exact pairs in memory
pairs_set = pairs.to_set
ids_to_exclude = candidate_events.select { |event|
pairs_set.include?([event.url, event.recurrence_id])
}.map(&:id)
@_abandoned_events = if ids_to_exclude.empty?
events_in_time_range
else
events_in_time_range.where.not(id: ids_to_exclude)
end
end
This approach:
The trade-off is we run 2 queries instead of 1, but the queries are simpler and more efficient. The in-memory filtering is negligible since we're already constrained by time range.
To prove our hypothesis, we wrote verification scripts that patched Rails internals to measure what was actually happening.
We created a script that patched visit_Arel_Nodes_Or to count how many times
it was called and how deep the recursion went:
module Arel
module Visitors
class ToSql
alias_method :original_visit_Arel_Nodes_Or, :visit_Arel_Nodes_Or
def visit_Arel_Nodes_Or(o, collector)
@or_call_count ||= 0
@or_call_count += 1
@max_depth ||= 0
@current_depth ||= 0
@current_depth += 1
@max_depth = [@max_depth, @current_depth].max
result = original_visit_Arel_Nodes_Or(o, collector)
@current_depth -= 1
result
end
end
end
end
Results with 50 test pairs:
visit_Arel_Nodes_Or was called 1 time and max recursion
depth was 0.visit_Arel_Nodes_Or was called 49 times and max recursion
depth was 100.With 233 pairs in production:
We ran the problematic query with different pair counts to find exactly where it breaks:
# Test with Async
Async do
pairs = calendar.events.limit(count).pluck(:url, :recurrence_id)
calendar.events.where.not([:url, :recurrence_id] => pairs).count
end.wait
# Test with Thread
Thread.new do
pairs = calendar.events.limit(count).pluck(:url, :recurrence_id)
calendar.events.where.not([:url, :recurrence_id] => pairs).count
end.join
Results:
This confirmed that Thread.new wasn't a real solution - it just had more headroom before hitting the same problem.
For now, we're sticking with our workaround. It works reliably, performs well, and doesn't require upgrading Rails immediately. When we do upgrade to Rails 7.2.2+, we can revert to the original clean syntax, knowing that the Rails team has fixed the underlying issue.
If you're on Rails 7.2.0 through 7.2.1.x and use composite key queries with large datasets, watch out for this issue. The fix is in Rails 7.2.2+, or you can work around it like we did.
If this blog was helpful, check out our full blog archive.