I recently encountered a bug related to MView log causing very high library cache: mutex x wait events.
I will brief about the debugging steps I tried and fix for the same.
Few things to note before I proceed:-
- We observed huge wait events for library cache: mutex X whenever we performed flip to standby or when DB was bounced. I am implying that library cache was cold and didn’t had required cursor information and object handles.
- Load on the database was very high. This includes very high number of sessions connected to database and high number of DMLs performed on same table by many sessions.
- Table on which DMLs are performed is having MLOG created on that based on primary key. This is required as these changes and data needs to flow to downstream databases via MView (fast refresh). So around 10 downstream sites were registered on this MLOG.
Root Cause Analysis:
After looking at the wait events, we immediately started digging into the root cause. We wanted to understand the bottleneck which is causing these wait events and remove whatever it takes to reduce these wait events.
I started with checking top 10 wait events in last 20 mins from v$active_session_history
SQL>select * from (select event, count(1) from v$active_session_history 2 where sample_time > (sysdate - 20/1440) group by event order by 2 desc) where rownum < 10; EVENT COUNT(1) ---------------------------------------- ---------- library cache: mutex X 50943 checkpoint completed 15170 read by other session 5487 row cache lock 4205 log file sync 3137 flashback buf free by RVWR 1815 db file sequential read 1675 log file switch completion 1611 cursor: pin S wait on X 1516 9 rows selected.
Clearly library cache: mutex X was way higher.
We can check who is causing library cache: mutex X by checking P1 and P2 of that wait event
SQL>select event, p1, count(1) from v$active_session_history where sample_time > (sysdate - 20/1440) and event = 'library cache: mutex X' group by event, p1 order by 3; EVENT P1 COUNT(1) ---------------------------------------- ---------- ---------- library cache: mutex X 421399181 1 library cache: mutex X 3842104349 1 library cache: mutex X 3477606468 1 library cache: mutex X 2432877226 1 library cache: mutex X 955484820 2 library cache: mutex X 37005 25 library cache: mutex X 1412465886 297 library cache: mutex X 2417922189 50615 8 rows selected.
As you can see max wait events are caused by P1 = 2417922189
P1 is idn – can be used for finding the cursor related to mutex
For library cache: mutex X
- P1 represents library cache hash bucket number (if idn <= 131072)
- P1 represents hash value of the library cache object under protection (if idn > 131072)
In our case hash value was 2417922189. So it represents library cache object.
We can check if this hash value belongs to a cursor (SQL) using v$sql
SQL>select sql_id from v$sql where hash_value = 2417922189; SQL_ID ------------- 3b7aa6f81x44d SQL>@sql Enter SQL_ID:- 3b7aa6f81x44d old 3: where sql_id = '&SQL_ID' new 3: where sql_id = '3b7aa6f81x44d' INSERT /*+ IDX(0) */ INTO "BOOKER"."MLOG$_FULFILLMENT_DEMAND_I" (dmltype$$,old_new$$,snaptime$$,change_vector$$,xid$$,"WAREHOUSE _ID","CO_WAREHOUSE_ITEM_ID") VALUES (:d,:o,to_date('4000-01-01:0 0:00:00','YYYY-MM-DD:HH24:MI:SS'),:c,:x,:1,:2)
As you can see this is DML on MLOG table. So clearly MLOG seems to be the bottleneck.
Wait events were making every thing stand still and nothing was getting processed. Since database availability was critical and we couldn’t afford to loose any more time because of wait events, our immediate action was to drop MLOG.
But since DB was having thousands of sessions and all stuck in library cache: mutex X, we cannot even get a lock to drop MLOG (not even with ddl_lock_timeout). Killing a session was not helping as they were keep coming back in no time.
So we decided to take quick bounce of DB (by keeping listeners down), dropped this MLOG and started listeners again.
This has fixed the issue and there were no wait events.
On down side, all downstream sites had to do complete refresh of MView followed by fast refresh as they got deregistered from MLOG. But it was OK as size of master table was not very big.
We also tried to investigate why an MLOG can cause this issue ? But were not able to get any insight.
MLOG is a table where primary key of changed records in main table will get inserted and when downstream table refreshes the data, these records will get deleted. So we don’t expect much higher size of MLOG. Unfortunately this is case it was 7.7GB (bigger than master table size).
Because the size of MLOG was too high, whenever a downstream database is doing fast refresh it has to update SNAPTIME$$ column in MLOG table which tells MLOG which rows are refreshed by downstream DB at what time. This update might be taking more time and holding enqueue locks. But inserts should not be blocked because of this update as it inserts a new record and doesn’t collied at enqueue level. Also, we were seeing these library cache wait events just after bounce or failover.
This is identified as a bug and Oracle has provided following patch as a fix for this issue
Patch 20879889: INSERT INTO MV LOG LEAVING TOO MANY OPEN CURSORS AFTER UPGR TO 18.104.22.168
Make sure you have this patch applied if you are using MLOG in your database.
Hope this helps !!