Skip to content

bug: async sales order grid sync permanently skips orders due to watermark race condition #40803

@damienwebdev

Description

@damienwebdev

Preconditions and environment

Steps to reproduce

  1. Enable async grid indexing: bin/magento config:set dev/grid/async_indexing 1
  2. Ensure cron is running (* * * * * schedule for sales_grid_order_async_insert)
  3. Create several test orders so the cron has work to do each cycle
  4. While the cron is mid-execution (processing other orders), manually complete an order in the admin panel (e.g., create invoice + shipment, or change status via the comment form) - you will need to artificially cause the query to take a long time.
  5. Wait for multiple cron cycles to pass (5+ minutes)
  6. Check the Sales > Orders grid in admin

Expected result

The order grid should reflect the updated order status (e.g., "Complete") within one or two cron cycles (1-2 minutes).

Actual result

The order grid permanently shows the old status (e.g., "Processing") for the affected order. The stale data persists indefinitely and never self-corrects.

Additional information

Root cause

UpdatedAtListProvider (vendor/magento/module-sales/Model/ResourceModel/Provider/UpdatedAtListProvider.php) uses a LastUpdateTimeCache watermark to optimize its query. The intended query is:

SELECT entity_id FROM sales_order main
INNER JOIN sales_order_grid grid
  ON main.entity_id = grid.entity_id
  AND main.updated_at > grid.updated_at

But the watermark adds an additional filter:

WHERE main.updated_at > :cached_watermark_timestamp

The race condition:

  1. Cron starts and queries for unsynced order IDs
  2. During query execution, an admin completes Order X — sales_order.updated_at is set to T1
  3. The cron's query (already in flight) does not see Order X
  4. Cron finishes processing other orders whose updated_at values are > T1, advances the watermark past T1
  5. Every subsequent cron run filters with WHERE main.updated_at > :watermark — Order X (at T1) fails this check and is skipped

Why it never self-heals:

The watermark is stored in Magento's cache with a 3600-second TTL (LastUpdateTimeCache.php:44-49). However, every cron cycle that processes at least one order calls lastUpdateTimeCache->save() (Grid.php:150), which resets the TTL. On any active store, the cron processes orders every minute, so the cache is perpetually refreshed and never expires. The skipped order is permanently stuck.

UpdatedIdListProvider cannot catch these orders either — it only finds orders completely missing from the grid (entity_id IS NULL), not orders with stale data.

Note on PR #40271: This PR (merged Dec 2025) moves the watermark from transient cache to persistent database storage via FlagManager. While it solves unnecessary full-table scans after cache flushes, it does not fix this race condition — and makes the permanent nature of the bug explicit rather than accidental (no TTL at all in the DB-backed version).

Note on commit 85baae4: This commit refactors UpdatedIdListProvider with cursor-based scanning. It improves performance for new orders missing from the grid but does not touch UpdatedAtListProvider where the race condition lives.

Timeline for understanding

Timeline:
─────────────────────────────────────────────────────────

14:30:00  Cron starts running. It queries for stale orders.

14:30:01  While the cron's query is executing, an admin
          completes Order #5432. sales_order is updated
          with updated_at = 14:30:01.

          But the cron's query already started — it doesn't
          see Order #5432.

14:30:02  Cron finishes processing OTHER orders it found.
          The highest updated_at among those was 14:30:05
          (from a different order). It saves the watermark
          as 14:30:05.

14:31:00  Next cron run. Watermark = 14:30:05.
          Query: WHERE main.updated_at > grid.updated_at
                   AND main.updated_at > '14:30:05'

          Order #5432 has updated_at = 14:30:01.
          14:30:01 > 14:30:05? NO. Filtered out. Skipped.

14:32:00  Next cron run. Same watermark. Same result. Skipped.

          ... every future cron run skips it too ...

Release note

When async grid indexing is enabled (dev/grid/async_indexing = 1), orders updated during a cron sync cycle could permanently show stale data in the admin Sales > Orders grid. The watermark optimization in UpdatedAtListProvider has been removed to ensure all out-of-sync orders are detected reliably.

Triage and priority

  • Severity: S0 - Affects critical data or functionality and leaves users without workaround.
  • Severity: S1 - Affects critical data or functionality and forces users to employ a workaround.
  • Severity: S2 - Affects non-critical data or functionality and forces users to employ a workaround.
  • Severity: S3 - Affects non-critical data or functionality and does not force users to employ a workaround.
  • Severity: S4 - Affects aesthetics, professional look and feel, “quality” or “usability”.

cc: @convenient

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Ready for Confirmation

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions