Adaptive Cursor Sharing

Adaptive cursor sharing was introduced by Oracle in 11g release 1 as a way to generate best plan for a SQL in all situation. Prior to adaptive cursor sharing, optimizer used to generate a single plan for a SQL statement and that plan is used by all cursors of that SQL_ID. So if we have data skewness in a column and that column is being using in “where” clause of the SQL statement, single plan of that SQL will not be optimum for all values of that column, which is passed to the bind variable.

Lets take a simple example. We will create a table and add random data to it.

SQL> create table T1
 2 (emp_id number,
 3 dept_id number,
 4 salary number,
 5 gender char(1),
 6 dob date,
 7 address varchar2(60));

Table created.

Let me add 10K records to this table

SQL> insert into T1
     SELECT LEVEL,
            (MOD (ROWNUM, 2)+1)*10,
            TRUNC (DBMS_RANDOM.VALUE (1000, 500000), 2),
            DECODE (ROUND (DBMS_RANDOM.VALUE (1, 2)), 1, 'M', 2, 'F'),
            TO_DATE ( ROUND (DBMS_RANDOM.VALUE (1, 28))
                     || '-'
                     || ROUND (DBMS_RANDOM.VALUE (1, 12))
                     || '-'
                     || ROUND (DBMS_RANDOM.VALUE (1900, 2010)),
                        'DD-MM-YYYY'
                     ),
            DBMS_RANDOM.STRING ('x', DBMS_RANDOM.VALUE (20, 50))
     FROM DUAL
     CONNECT BY LEVEL < 10000;

9999 rows created.

We can make one of the column – DEPT_ID as skewed by updating the records as below

SQL> update T1 set dept_id = 10 where rownum <= 9950;

9950 rows updated.

SQL> update T1 set dept_id = 20 where dept_id != 10;

49 rows updated.

SQL> select dept_id, count(1) from T1 group by dept_id;

   DEPT_ID COUNT(1)
---------- ----------
        20 49
        10 9950

So we have skewness for column DEPT_ID now.

Lets gather statistics on this table with cascade option so that it will gather statistics on index as well. Since the data is skew in index, it will also generate histogram.


SQL> exec dbms_stats.gather_table_stats(OWNNAME=>user, TABNAME=>'T1',CASCADE => TRUE , METHOD_OPT => 'FOR ALL INDEXED COLUMNS SIZE 254');

PL/SQL procedure successfully completed.

SQL> select * from dba_histograms where TABLE_NAME = 'T1';

OWNER                          TABLE_NAME                     COLUMN_NAME                    ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU
------------------------------ ------------------------------ ------------------------------ --------------- -------------- --------------------
ADVAITD_DBA                    T1                             DEPT_ID                                   9950             10
ADVAITD_DBA                    T1                             DEPT_ID                                   9999             20

SQL> select table_name, column_name, histogram from user_tab_cols where table_name = 'T1' and histogram != 'NONE';

TABLE_NAME                     COLUMN_NAME                    HISTOGRAM
------------------------------ ------------------------------ ---------------
T1                             DEPT_ID                        FREQUENCY

We can see it has created frequency histogram on the index column – DEPT_ID

Adaptive cursor sharing is affected by multiple factors. Following are some of the important factors which affects adaptive cursor sharing.

  • Adaptive cursor sharing is enabled or not. This is the first basic factor which determines if adaptive cursor sharing is enabled or not
  • CURSOR_SHARING parameter and what value we are setting for this parameter
  • SQL plan baseline. If we have baselines enabled and what all SQL baselines are available for a SQL in ENABLED=YES and ACCEPTED=YES
  • OPTIMIZER_INDEX_COST_ADJ – What is the value of this parameter set in database.

You can check if adaptive cursor sharing is enabled by checking the parameter _optimizer_adaptive_cursor_sharing

SQL>@param
Enter Parameter name: _adaptive_cursor_sharing

Parameter                            |Description                                                 |Session Va|Instance V
-------------------------------------|------------------------------------------------------------|----------|----------
_optimizer_adaptive_cursor_sharing   |optimizer adaptive cursor sharing                           |TRUE      |TRUE

Now that we have adaptive cursor sharing enabled, lets see how it works and the effect of other factors on adaptive cursor sharing

How adaptive cursor sharing works?

So we have a table T1 with 10K records and one of the column DEPT_ID is having skewed data as seen above. We have 9950 records with DEPT_ID = 10 and 49 records with DEPT_ID = 20.

Lets try to run some queries and check how optimizer behaves

Running SQL for DEPT_NO = 20


SQL>exec :dept_no :=20;
SQL>select /*+ GATHER_PLAN_STATISTICS */ count(1), max(dob) from t1 where dept_id = :dept_no;

  COUNT(1)|MAX(DOB)
----------|----------------
        49|2002-09-06:00:00

SQL>select sql_id, child_number, executions, buffer_gets, is_bind_sensitive, is_bind_aware, is_shareable from v$sql where sql_text like '%GATHER_PLAN_STATISTICS%';

SQL_ID       |CHILD_NUMBER|PLAN_HASH_VALUE|EXECUTIONS|BUFFER_GETS|I|I|I
-------------|------------|---------------|----------|-----------|-|-|-
btj8b1hx3cf0g|           0|     3776485839|         1|         58|Y|N|Y

Pay attention to the columns IS_BIND_SENSITIVE, IS_BIND_AWARE, IS_SHAREABLE. Column IS_BIND_SENSITIVE tells whether output of the query is sensitive to values of bind variable. Meaning if we change the value of bind variable, will the output change significantly?

In above case, optimizer is showing IS_BIND_SENSITIVE=Y based on the histogram that is present in the column used in “where” clause.

As per Oracle definition, IS_BIND_SENSITIVE VARCHAR2(1) Indicates whether the cursor is bind sensitive (Y) or not (N). A query is considered bind-sensitive if the optimizer peeked at one of its bind variable values when computing predicate selectivities and where a change in a bind variable value may cause the optimizer to generate a different plan.

The plan its using currently is index range scan

select /*+ GATHER_PLAN_STATISTICS */ count(:"SYS_B_0"), max(dob) from
t1 where dept_id = :dept_no

Plan hash value: 3776485839

---------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |       |       |     1 (100)|          |
|   1 |  SORT AGGREGATE              |              |     1 |    12 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1           |    49 |   588 |     1   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | I_T1_DEPT_ID |    49 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("DEPT_ID"=:DEPT_NO)

Lets run the query again but this time we will use the bind value of 10, which has 9950 records

Running SQL for DEPT_NO = 10


SQL>exec :dept_no := 10;
SQL>select /*+ GATHER_PLAN_STATISTICS */ count(1), max(dob) from t1 where dept_id = :dept_no;

  COUNT(1)|MAX(DOB)
----------|----------------
      9950|2010-12-27:00:00

SQL>select sql_id, child_number, plan_hash_value , executions, buffer_gets, is_bind_sensitive, is_bind_aware,IS_SHAREABLE from v$sql where sql_text like '%GATHER_PLAN_STATISTICS%';

SQL_ID       |CHILD_NUMBER|PLAN_HASH_VALUE|EXECUTIONS|BUFFER_GETS|I|I|I
-------------|------------|---------------|----------|-----------|-|-|-
btj8b1hx3cf0g|           0|     3776485839|         2|        169|Y|N|Y

You can see now, we have 2 executions but same cursor is being used. During the second run, we used different bind value and optimizer becomes aware that using different bind value is actually fetching different number of records and plan is expensive.

So even though it didn’t generate any new cursor for this and used same plan as it had earlier, it will generate a new cursor when we run this again for DEPT_NO = 10 and hopefully it will also use a new plan (probably full table scan).

Lets run this query again for same DEPT_NO = 10

Running SQL for DEPT_NO = 10


SQL>exec :dept_no := 10;
SQL>select /*+ GATHER_PLAN_STATISTICS */ count(1), max(dob) from t1 where dept_id = :dept_no;

  COUNT(1)|MAX(DOB)
----------|----------------
      9950|2010-12-27:00:00

SQL>select sql_id, child_number, plan_hash_value , executions, buffer_gets, is_bind_sensitive, is_bind_aware,IS_SHAREABLE from v$sql where sql_text like '%GATHER_PLAN_STATISTICS%';

SQL_ID       |CHILD_NUMBER|PLAN_HASH_VALUE|EXECUTIONS|BUFFER_GETS|I|I|I
-------------|------------|---------------|----------|-----------|-|-|-
btj8b1hx3cf0g|           0|     3776485839|         2|        169|Y|N|N
btj8b1hx3cf0g|           1|     3724264953|         1|         96|Y|Y|Y

Now, you see that it has created a new cursor along with a new plan and also is_bind_aware becomes YES for new cursor. At the same time, cursor # 0 becomes non-shareable (check IS_SHAREABLE column is showing ‘N’). So since optimizer knows that this query is bind sensitive, it will create new cursors which are bind aware and doesn’t use the cursor it created initially, which was not bind aware.

As per Oracle definition, IS_BIND_AWARE VARCHAR2(1) Indicates whether the cursor is bind aware (Y) or not (N). A query is considered bind-aware if it has been marked to use extended cursor sharing. The query would already have been marked as bind-sensitive.

Sometimes optimizer just creates a new cursor with same plan and it might go for index range scan even for selecting majority of rows from the table. This happened with me when I was trying this example. In such cases, cost of plan is determined by optimizer_index_cost_adj parameter. In my case, value of this parameter is set to lowest value of 1 so it will going for index range scan in all cases. Lower value of this variable represents lower cost for index access and it favors use of index. I set the value of this parameter to 50 and after that it was going for full table scan.

So if you check the plan for cursor 1, it will show full table scan as we are selecting 99% of rows from the table (DEPT_NO = 10 will fetch 9950 rows)

SQL>select * from table(dbms_xplan.display_cursor('btj8b1hx3cf0g',1));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  btj8b1hx3cf0g, child number 1
-------------------------------------
select /*+ GATHER_PLAN_STATISTICS */ count(:"SYS_B_0"), max(dob) from
t1 where dept_id = :dept_no

Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |    27 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |    12 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |  9950 |   116K|    27   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DEPT_ID"=:DEPT_NO)

At this point if we go for another execution with DEPT_NO = 20, it will create another cursor, which will be bind aware

Running SQL for DEPT_NO = 20

SQL>exec :dept_no := 20;
SQL>select /*+ GATHER_PLAN_STATISTICS */ count(1), max(dob) from t1 where dept_id = :dept_no;

  COUNT(1)|MAX(DOB)
----------|----------------
        49|2002-09-06:00:00

SQL>select sql_id, child_number, plan_hash_value , executions, buffer_gets, is_bind_sensitive, is_bind_aware,IS_SHAREABLE from v$sql where sql_text like '%GATHER_PLAN_STATISTICS%';

SQL_ID       |CHILD_NUMBER|PLAN_HASH_VALUE|EXECUTIONS|BUFFER_GETS|I|I|I
-------------|------------|---------------|----------|-----------|-|-|-
btj8b1hx3cf0g|           0|     3776485839|         2|        169|Y|N|N
btj8b1hx3cf0g|           1|     3724264953|         1|         96|Y|Y|Y
btj8b1hx3cf0g|           2|     3776485839|         1|          3|Y|Y|Y

As we can see it created cursor 2 which is bind aware. So now cursor 0 is not shareable and will be removed from shared pool in sometime but cursor 1 and 2 are shareable and will be used by further executions.

Effect of bind peeking on adaptive cursor sharing

One of the important thing for using adaptive cursor sharing is that you need to enable bind peeking in database. Without enabling bind peeking, adaptive cursor sharing will not work.

Initially, in my environment, bind peeking was not enabled and because of that optimizer doesn’t have visibility on what values are being passed to the bind variables. So in that case optimizer will not generate new cursors even when adaptive cursor sharing is enabled. Optimizer does NOT even mark the cursor as bind sensitive as it doesn’t have any visibility in bind values

As you can see below, it was using same cursor for multiple executions


SQL>select sql_id, child_number, executions, buffer_gets, is_bind_sensitive, is_bind_aware from v$sql where sql_text like '%test_sql%';

SQL_ID       |CHILD_NUMBER|EXECUTIONS|BUFFER_GETS|I|I
-------------|------------|----------|-----------|-|-
2bxcmkhms19jn|           0|        12|        220|N|N

If we check the bind peeking, it was turned off

SQL>@param
Enter Parameter name: peek

Parameter                            |Description                                                 |Session Va|Instance V
-------------------------------------|------------------------------------------------------------|----------|----------
_optim_peek_user_binds               |enable peeking of user binds                                |FALSE     |FALSE

Effect of CURSOR_SHARING parameter on adaptive cursor sharing

Cursor sharing plays an important role when we are using adaptive cursor sharing. CURSOR_SHARING parameters determines if a cursor can be shared or a new cursor should be created. This was discussed very well on https://blogs.oracle.com/optimizer/entry/explain_adaptive_cursor_sharing_behavior_with_cursor_sharing_similar_and_force

But will give a brief about how CURSOR_SHARING plays an important role.
Until 11g R2, we had 3 values for CURSOR_SHARING parameter

  • EXACT: By setting this value, optimizer will not replace the literals used in the statement with system generated bind variables. So if you are using literals in the statement, each statement is considered as different depending on the literal values you have used.
  • SIMILAR: Using this value, optimizer will replace the literals with system generated bind variable only if you don’t have histogram on the column for which you have used literal. If you have histogram, optimizer will NOT replace the literal with bind variable
  • FORCE: This value will replace all the literals with system generated bind variables irrespective of whether a histogram is present on the column or not.

Example: we have following statement with literal

select /*+ GATHER_PLAN_STATISTICS */ count(1), max(dob) from t1 where dept_id = 10;

This is the same statement that we used above and optimzer knows that DEPT_ID column is skewed and has histogram defined on it. Becuase it has histogram defined, optimizer knows that data is skewed and replacing the literal value with system generated bind variable can lead to suboptimal plan.

So if we use CURSOR_SHARING = SIMILAR and we have statements like above where we are using literal values on columns with histogram, optimizer will not replace the literal with bind variables.

In such cases adaptive cursor sharing will not kick in as no bind variables present in the query.

For adaptive cursor sharing to work in such situation, we need to drop histograms on the columns where we are using literal values. This has its own problems as data skewness cannot be determined by optimizer.

This is one of the main reason why a value of SIMILAR is deprecated in oracle 12c for CURSOR_SHARING parameter. Oracle encourages to use only FORCE and EXACT going forward.

Effect of baseline on adaptive cursor sharing

This is one of the thing that is covered in good detail in https://blogs.oracle.com/optimizer/entry/how_do_adaptive_cursor_sharing. I am just covering this topic briefly for the sake of completion.

So baseline has a different purpose than adaptive cursor sharing. While adaptive cursor sharing deals with using the right cursor and plan for a SQL statement depending on the bind vales we are passing and skewness of data, purpose of baseline is to provide the stability and avoid using any new unverified plan.

Adaptive cursor sharing kicks in before SQL management baseline. So when the statement is executed, adaptive cursor sharing will peak for bind values and will check the skewness of data. Based on these inputs, it will check if one of the existing cursor and plan can be used or it can derive a new better plan. Adaptive cursor sharing can tells optimizer that a new plan or a cursor needs to be created and after that control moves on to the next part of optimizer. So adaptive cursor sharing only suggests that a new cursor or plan is better and required to be created and provides all the required data.

Now, if we have SQL plan baseline already enabled on this SQL, it will prevent optimizer to use a new plan. However, optimizer can create a new plan and save that in the history of SQL management base. This new plan will be enabled but NOT ACCEPTED. EIther DBA has to manually accept the plan or have to wait for “sql tuning advisor” auto task to kick in and accept the baseline if its verified as good. Only after accepting the new plan, it will be used for the SQL.

In my case, I had a baseline created for the SQL initially.


SQL>exec :dept_no :=20;
SQL>select /*+ GATHER_PLAN_STATISTICS */ count(1), max(dob) from t1 where dept_id = :dept_no;

  COUNT(1)|MAX(DOB)
----------|----------------
        49|2002-09-06:00:00


If you check the explain plan for this SQL

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  btj8b1hx3cf0g, child number 0
-------------------------------------
select /*+ GATHER_PLAN_STATISTICS */ count(:"SYS_B_0"), max(dob) from
t1 where dept_id = :dept_no

Plan hash value: 3776485839

---------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |       |       |     1 (100)|          |
|   1 |  SORT AGGREGATE              |              |     1 |    12 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1           |    49 |   588 |     1   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | I_T1_DEPT_ID |    49 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("DEPT_ID"=:DEPT_NO)

Note
-----
   - SQL plan baseline SQL_PLAN_gwfyvvw5h1w96e68c0740 used for this statement

We can see baseline is getting used in the Note section. We can also see the details of baseline in DBA_SQL_PLAN_BASELINES table


SQL>select sql_handle, plan_name from dba_sql_plan_baselines where signature = (select exact_matching_signature from v$sql where sql_id = 'btj8b1hx3cf0g');

SQL_HANDLE                    |PLAN_NAME
------------------------------|------------------------------
SQL_fe3bdbdf0b00f126          |SQL_PLAN_gwfyvvw5h1w96e68c0740

Even after running the SQL multiple times for different bind values, we are not getting any new cursor. Also, the cursor is not marked as bind sensitive as shown in below output


SQL>select sql_id, child_number, plan_hash_value , executions, buffer_gets, is_bind_sensitive, is_bind_aware,IS_SHAREABLE from v$sql where sql_text like '%GATHER_PLAN_STATISTICS%';

SQL_ID       |CHILD_NUMBER|PLAN_HASH_VALUE|EXECUTIONS|BUFFER_GETS|I|I|I
-------------|------------|---------------|----------|-----------|-|-|-
btj8b1hx3cf0g|           0|     3776485839|         5|        339|N|N|Y

So if we have SQL plan baselines enabled for a SQL and we have only 1 baseline accepted, we will not see the advantages of adaptive cursor sharing unless we have multiple baselines ACCEPTED so that optimizer can use different plan as suggested by adaptive cursor sharing.

This covers most of the situation for using adaptive cursor sharing.

Hope this helps !!

Reference

https://blogs.oracle.com/optimizer/entry/how_do_adaptive_cursor_sharing

https://blogs.oracle.com/optimizer/entry/update_on_adaptive_cursor_sharing

https://docs.oracle.com/database/121/REFRN/GUID-455358F8-D657-49A2-B32B-13A1DC53E7D2.htm#REFRN10025

https://blogs.oracle.com/optimizer/entry/explain_adaptive_cursor_sharing_behavior_with_cursor_sharing_similar_and_force

Monitoring transaction recovery

Sometimes we end up in a situation where our long running transaction is not completing and we are also not sure how much further time it’s going to take. This happened with one of our DBA where they found MLOG to be bloated because of one orphan snapshot entry. Orphan entries are the one where actual site is not registered on master (no entry in DBA_REGISTERED_SNAPSHOTS), but they see entry for MLOGS (entry in DBA_SNAPSHOT_LOGS). This could happen if we try to drop snapshot from downstream database and it does not get cleaned up on upstream databases.

So in the situation that I faced, upstream team had MLOG which was bloated to 18GB and MLOG also had an index which was bloated to 30GB. (ya, I know its bad :-))

So they identified the orphan snapshot ID and they wanted to purge that from snapshot log to reduce the size of MLOG (after they move the MLOG and rebuild the index after doing the purge).

They used following procedure of DBMS_SNAPSHOT for purging snapshot ID from log

PROCEDURE PURGE_SNAPSHOT_FROM_LOG
 Argument Name                  Type                    In/Out Default?
 ------------------------------ ----------------------- ------ --------
 SNAPSHOT_ID                    BINARY_INTEGER          IN

After they started the activity in the morning and monitoring the same until evening, it was still not complete. I helped them in tracking the progress by checking real time SQL monitoring report and it was showing that session has already read around 60GB and undo used until that time was around 48GB. It was still not clear how the command has read 60GB worth of data when MLOG size was only 18GB.

Also, original base table was just 2GB.

At this point they wanted to kill the session. But killing the session will not help immediately as it has to perform huge rollback as well (48GB of UNDO).

But since command was not completing and took almost entire shift, they decided to kill the session. So session was killed using “ALTER SYSTEM KILL SESSION ‘<sid>,<serial#>’ immediate” and session was marked for kill. But session was just marked as killed and it was still holding the lock (if we check in V$LOCK view). This was because session was doing the rollback. We can monitor the progress of rollback using V$TRANSACTION view

You can look at used_ublk in V$transaction to estimate how long it is going
to take to complete the rollback.

SQL> SELECT a.used_ublk 
 FROM v$transaction a, v$session b
 WHERE a.addr = b.taddr AND b.sid = <SID>;

For example:

If used_ublk showed 29,900 12 hours ago and is now 22,900, it has taken 12 hours to rollback 7,000 entries. It will take approximately another 36 hours to complete depending on the types of transactions that are rolling back.

Recovery was very slow as session was doing serial recovery. Next we found the OS PID of the session and killed the OS process as well so that recovery can happen in the background using SMON. Within few mins PMON performed the clean up and lock was released.

Rollback continued in the background and this is faster than the rollback performed by the session. If we kill the session and the shadow process at OS level, SMON picks up the rollback part and it goes for parallel rollback, which is faster.

V$FAST_START_TRANSACTIONS & X$KTUXE

We can monitor the progress of rollback in V$FAST_START_TRANSACTIONS view.

V$fast_start_transactions -> contains one row for each one of the transactions that Oracle is recovering in Parallel.

FAST_START_PARALLEL_ROLLBACK shows the maximum number of processes which may exist for performing parallel rollback.

In fast-start parallel rollback, the background process SMON acts as a coordinator and rolls back a set of transactions in parallel using multiple server processes.

Fast start parallel rollback is mainly useful when a system has transactions that run a long time before committing, especially parallel Inserts, Updates, Deletes operations. When SMON discovers that the amount of recovery work is above a certain threshold, it automatically begins parallel rollback by dispersing the work among several parallel processes.

The following queries are available to monitor the progress of the transaction recovery

set linesize 100 
 alter session set NLS_DATE_FORMAT='DD-MON-YYYY HH24:MI:SS'; 
 select usn, state, undoblockstotal "Total", undoblocksdone "Done", undoblockstotal-undoblocksdone "ToDo", 
 decode(cputime,0,'unknown',sysdate+(((undoblockstotal-undoblocksdone) / (undoblocksdone / cputime)) / 86400)) 
 "Estimated time to complete" 
 from v$fast_start_transactions;

Run the above query several times in a row, this will give you a good idea on how SMON is progressing.

  • In some versions the cputime does not work (always 0), hence the estimated completion time will not be displayed
  • In some cases the v$fast_start_transactions view will not work. If this is the case then you can query the internal data dictionary view x$ktuxe

The ‘ktuxesiz’ column represents the remaining number of undo blocks required for rollback:

select ktuxeusn, to_char(sysdate,'DD-MON-YYYY HH24:MI:SS') "Time", ktuxesiz, ktuxesta 
 from x$ktuxe 
 where ktuxecfl = 'DEAD';

I was not able to see recover progress using V$FAST_START_TRANSACTIONS, but I was able to see the progress in x$ktuxe view.

 


select ktuxeusn, to_char(sysdate,'DD-MON-YYYY HH24:MI:SS') "Time", ktuxesiz, ktuxesta 
from x$ktuxe 
where ktuxecfl = 'DEAD'; 

  KTUXEUSN|Time                      |  KTUXESIZ|KTUXESTA
----------|--------------------------|----------|----------------
      2167|01-AUG-2016 12:05:14      |   5260156|ACTIVE
SQL>/

  KTUXEUSN|Time                      |  KTUXESIZ|KTUXESTA
----------|--------------------------|----------|----------------
      2167|01-AUG-2016 12:05:15      |   5259945|ACTIVE
SRW1NA>/

  KTUXEUSN|Time                      |  KTUXESIZ|KTUXESTA
----------|--------------------------|----------|----------------
      2167|01-AUG-2016 12:05:15      |   5259854|ACTIVE

..
..
..
<After 2-3 hours>

  KTUXEUSN|Time                      |  KTUXESIZ|KTUXESTA
----------|--------------------------|----------|----------------
      2167|01-AUG-2016 16:31:47      |    612697|ACTIVE

Speeding up recovery

We can further improve the speed of recovery by taking following steps

1) There are cases where parallel transaction recovery is not as fast as serial transaction recovery, because the pq slaves are interfering with each other. To check the Parallel Recovery processes and there state query:

select * from v$fast_start_servers;

Column STATE shows the state of the server being IDLE or RECOVERING, if only 1 process is in state RECOVERING while the other processes are in state IDLE, then you should disable Parallel Transaction Recovery. How to do this is outlined in the following note:

Note 238507.1: How to Disable Parallel Transaction Recovery When Parallel Txn Re very is Active

2) If all the rows are showing RECOVERING in STATE column of v$fast_start_servers, then you will get benefitted if you add more threads for doing the recovery.

You can do so by setting value of FAST_START_PARALLEL_ROLLBACK parameter. You should set a value of HIGH if you want to speed up the recovery.

Following are the different values of this parameter

  • FALSE – Parallel rollback is disabled
  • LOW – Limits the maximum degree of parallelism to 2 * CPU_COUNT
  • HIGH -Limits the maximum degree of parallelism to 4 * CPU_COUNT

Note that, this parameter is not dynamic and needs database bounce. Also, If you change the value of this parameter, then transaction recovery will be stopped and restarted with the new implied degree of parallelism. So if you are already done more than half the rollback and you think its not worth to change this parameter, you can leave it. Else if you still change this parameter, recovery will start from the beginning again.

3) Increase the parameter ‘_cleanup_rollback_entries’

This parameter determines number of undo entries to apply per transaction cleanup. The default value is 100. You can change that to, say 400.This parameter cannot be changed dynamically, so in order to change this the database will need to be restarted.

In our specific situation, we knew that huge rollback needs to be performed and we were monitoring the rollback progress from the beginning. So we made a decision at the very beginning to set FAST_START_PARALLEL_ROLLBACK to HIGH and bounce the DB. This improved recovery speed right from the beginning.

References:

SMON: Parallel transaction recovery tried (Doc ID 1458738.1) To BottomTo Bottom

Troubleshooting Database Transaction Recovery (Doc ID 1494886.1)

Database Hangs Because SMON Is Taking 100% CPU Doing Transaction Recovery (Doc ID 414242.1)

SMON: Parallel transaction recovery tried (Doc ID 1458738.1)

Note 238507.1: How to Disable Parallel Transaction Recovery When Parallel Txn Re very is Active

Fixing SQL plans on ADG using SQL Profiles

With Active Dataguard setup, many of the read only applications runs on ADG and sometime the SQLs that are used by these applications runs into bad plans. ADG being a read only database, its not possible to create a baseline or a profile on ADG.

One of the way to have good plan for SQLs running on ADG is to make them run on primary first and fix the plan by creating profile or baseline and then transfer these SQLs to ADG. Practically I have seen many times that Dev teams are not very flexible in changing the configuration to run the queries on primary. Either there configuration setup is complex and takes more time and efforts to change them or most of the time in bigger enterprises, they have central configuration file which when changed, changes the configuration for entire region. In that case complete software techstack will run on primary for entire region and not just 1 database.

Another way to deal make queries run on primary is to point ADG CNAME to primary. That way all the applications which are supposed to run on ADG will start running on primary. Down side for this is the increase in load on primary as all the applications on ADG will connect to primary. Its too much of a change to fix 1 SQL on ADG.

This article is about fixing plans on ADG using profile, without making any other change. We don’t have to run the query on primary to fix the plan.

Fixing SQL plan on ADG

In one of my old article “Fixing SQL Plans: The hard way – Part 1“, I mentioned about DBMS_SQLTUNE.IMPORT_SQL_PROFILE procedure. This procedure takes internal SQL hints that optimizer understands as input and create a profile to fix the plan.

In that method, I am providing SQL text as input, but the way I am providing SQL text as input is by selecting “SQL_TEXT” column from either V$SQL_TEXT view or from DBA_HIST_SQLTEXT view.

Following is the partial code of the PL/SQL block I used to create profile

for sql_rec in (
 select t.sql_id, t.sql_text
 from dba_hist_sqltext t, dba_hist_sql_plan p
 where t.sql_id = p.sql_id
 and p.sql_id = '&SQL_ID_TO_FIX'
 and p.plan_hash_value = &BAD_PLAN_HASH_VALUE
 and p.parent_id is null
)
loop
 DBMS_SQLTUNE.IMPORT_SQL_PROFILE(
 sql_text => sql_rec.sql_text,
 profile => ar_profile_hints,
 name => 'PROFILE_33fndgzsas09k');

end loop;

Why we cannot use this method for ADG?

We cannot use this method for ADG. You will find the required SQL text in V$SQL_TEXT view in ADG, but you cannot create SQL profile on ADG as its a read-only database.
Also, you cannot create a profile on primary because you will not find SQL text on primary.

So, How do we fix this?

Method 1:

One of the method I thought of for fixing the SQL is by creating a database link between primary and ADG (say adg_db_link) and use that DB link to get SQL text from standby. So above PL/SQL code will look like below. Note that I have change DBA_HIS* table with V$ views because ADG data will be present in only V$ views. DBA_HIST* tables will have data from primary database only as part of AWR snapshot.

for sql_rec in (
 select t.sql_id, t.sql_text
 from v$sqltext@adg_db_link t, v$sql_plan@adg_db_link p
 where t.sql_id = p.sql_id
 and p.sql_id = '&SQL_ID_TO_FIX'
 and p.plan_hash_value = &BAD_PLAN_HASH_VALUE
 and p.parent_id is null
)
loop
 DBMS_SQLTUNE.IMPORT_SQL_PROFILE(
 sql_text => sql_rec.sql_text,
 profile => ar_profile_hints,
 name => 'PROFILE_33fndgzsas09k');

end loop;

You can run the complete PL/SQL block on primary and using the DB link will fetch the required SQL text from ADG and you can create SQL profile on primary.

Once this SQL profile is created, it should be picked up by SQL query on ADG. Note that you need to purge the SQL from shared pool of ADG so that SQL will go for hard parse and pick the profile.

Method 2:

Another method we can use, is to pass SQL text as input to DBMS_SQLTUNE.IMPORT_SQL_PROFILE procedure. This is the method I implemented and it worked for me.

Let’s take a example.

Following is my SQL text that I want to fix. I was having a completely different SQL in my environment, but I cannot provide actual text for that SQL. My SQL was very big and complex, So I simplified the same and changed table name and column name to understand easily.

select * from ( 
          select this_.col1 as col1,
                 this_.col2 as col2, 
                 this_.col3 as col3, 
                 this_.col4 as col4
          from TAB1 this_ 
          inner join TAB2 compaddres3_ on this_.col2 = compaddres3_.col2 
          inner join TAB3 compshipme1_ on this_.col1=compshipme1_.col1
          where this_.col1 in (:1) 
          and   this_.col2 in (:2 , :3) 
          and   compshipme1_.col1 in (:4 )
          and   this_.col8 = 'OPEN'
          and   this_.col5>=:5 
          and   this_.col5<=:6 
          order by this_.col1 asc ) 
where rownum <= :7

Assuming that existing plan in ADG database is bad, we need to find a good plan of this SQL and also create a profile so that new good plan will be picked in ADG.
We are going to do that without running this SQL on primary or without pointing ADG CNAME to primary. Also, we are not going to do any code changes to add any hints.

Following are the steps to fix the plan on ADG

1) Get a good plan for this SQL

You can use the hints in SQL text and do an explain plan on primary to check if the plan looks good. Alternatively, if you want to be sure, you can run the hinted SQL in ADG to make sure that SQL runs good and efficiently.

I used few index hints in my SQL to make sure correct index is getting picked. I did “explain plan” for the SQL to check the explain plan.

explain plan for
select * from ( 
          select /*+ index(tab1 I_UNIQ_COL2) */ this_.col1 as col1,
                 this_.col2 as col2, 
                 this_.col3 as col3, 
                 this_.col4 as col4
          from TAB1 this_ 
          inner join TAB2 compaddres3_ on this_.col2 = compaddres3_.col2 
          inner join TAB3 compshipme1_ on this_.col1=compshipme1_.col1
          where this_.col1 in (:1) 
          and   this_.col2 in (:2 , :3) 
          and   compshipme1_.col1 in (:4 )
          and   this_.col8 = 'OPEN'
          and   this_.col5>=:5 
          and   this_.col5<=:6 
          order by this_.col1 asc ) 
where rownum <= :7

I got following plan after using required hints

Plan hash value: 2524091007

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name         | Rows | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------------------------
| 0   | SELECT STATEMENT                          |              | 134  | 1524K | 4134   (3) | 00:00:50 |       |       |
|* 1  |  COUNT STOPKEY                            |              |      |       |            |          |       |       |
| 2   |   VIEW                                    |              | 134  | 1524K | 4134   (3) | 00:00:50 |       |       |
|* 3  |    SORT ORDER BY STOPKEY                  |              | 134  | 71154 | 4134   (3) | 00:00:50 |       |       |
|* 4  |     FILTER                                |              |      |       |            |          |       |       |
| 5   |      NESTED LOOPS                         |              | 134  | 71154 | 4133   (3) | 00:00:50 |       |       |
| 6   |       NESTED LOOPS                        |              | 134  | 43550 | 3731   (3) | 00:00:45 |       |       |
| 7   |        TABLE ACCESS BY INDEX ROWID        | TAB1         | 1    | 22    | 1      (0) | 00:00:01 |       |       |
|* 8  |         INDEX UNIQUE SCAN                 | I_UNIQ_COL2  | 1    |       | 0      (0) | 00:00:01 |       |       |
| 9   |        TABLE ACCESS BY GLOBAL INDEX ROWID | TAB2         | 134  | 40602 | 3730   (3) | 00:00:45 | ROWID | ROWID |
|* 10 |         INDEX RANGE SCAN                  | I_TAB2_COL2  | 242  |       | 3507   (3) | 00:00:43 |       |       |
| 11  |        TABLE ACCESS BY GLOBAL INDEX ROWID | TAB3         | 1    | 206   | 3      (0) | 00:00:01 | ROWID | ROWID |
|* 12 |         INDEX RANGE SCAN                  | PK_TAB3_COL1 | 1    |       | 2      (0) | 00:00:01 |       |       |
-------------------------------------------------------------------------------------------------------------------------

2) Obtain outline hints for SQL

Once you perform explain plan for the SQL, you can use FORMAT=>’ADVANCED’ to get the neccessary outline hints

select * from table(dbms_xplan.display(format=>’advanced’));

/*+
 BEGIN_OUTLINE_DATA
 USE_NL(@"SEL$54D64B3C" "COMPADDRES3_"@"SEL$2")
 USE_NL(@"SEL$54D64B3C" "THIS_"@"SEL$2")
 LEADING(@"SEL$54D64B3C" "COMPSHIPME1_"@"SEL$3" "THIS_"@"SEL$2" "COMPADDRES3_"@"SEL$2")
 INDEX_RS_ASC(@"SEL$54D64B3C" "COMPADDRES3_"@"SEL$2" ("TAB2"."COL2"))
 INDEX_RS_ASC(@"SEL$54D64B3C" "THIS_"@"SEL$2" ("TAB1"."COL1" "TAB1"."COL2" "TAB1"."COL3" "TAB1"."COL4"))
 INDEX_RS_ASC(@"SEL$54D64B3C" "COMPSHIPME1_"@"SEL$3" ("TAB3"."COL1"))
 NO_ACCESS(@"SEL$1" "from$_subquery$_001"@"SEL$1")
 OUTLINE(@"SEL$2")
 OUTLINE(@"SEL$3")
 MERGE(@"SEL$2")
 OUTLINE(@"SEL$64EAE176")
 OUTLINE(@"SEL$4")
 OUTLINE_LEAF(@"SEL$1")
 MERGE(@"SEL$64EAE176")
 OUTLINE_LEAF(@"SEL$54D64B3C")
 ALL_ROWS
 OPT_PARAM('_optimizer_extended_cursor_sharing_rel' 'none')
 DB_VERSION('11.2.0.4')
 OPTIMIZER_FEATURES_ENABLE('11.2.0.4')
 IGNORE_OPTIM_EMBEDDED_HINTS
 END_OUTLINE_DATA
 */

Step 3) Above outline data is useful for creating profile for query running on ADG

SQL Profile have to created on primary as we cannot create SQL profile on ADG (read only database).
Previous method described in”Fixing SQL Plans: The hard way – Part 1” uses the SQL ID as input and queries V$SQLTEXT or DBA_HIST* tables to get the required SQL text. This SQL text is then supplied to DBMS_SQLTUNE.IMPORT_SQL_PROFILE procedure.

But since we don’t have SQL text in primary, we can directly supply SQL text to DBMS_SQLTUNE.IMPORT_SQL_PROFILE procedure as shown below.

Note – If you have single quotes in your script, you need to change that to 2 single quotes. 1st single quote works as escape character

In my SQL text, I have a single quote literal. So I have to enclose that to 2 single quotes as shown below

select * from ( 
       select this_.col1 as col1,
              this_.col2 as col2, 
              this_.col3 as col3, 
              this_.col4 as col4
       from TAB1 this_ 
       inner join TAB2 compaddres3_ on this_.col2 = compaddres3_.col2 
       inner join TAB3 compshipme1_ on this_.col1=compshipme1_.col1
       where this_.col1 in (:1) 
       and   this_.col2 in (:2 , :3) 
       and   compshipme1_.col1 in (:4 )
       and   this_.col8 = ''OPEN''
       and   this_.col5>=:5 
       and   this_.col5<=:6 
       order by this_.col1 asc ) 
where rownum <= :7

Above SQL will be given as input to DBMS_SQLTUNE.IMPORT_SQL_PROFILE procedure.
Similarly, if we have single quotation in outline hints that we extracted, we need to change that to 2 single quotes as shown below.

Also, we need to mark every outline hint in single quotation.

'BEGIN_OUTLINE_DATA',
'USE_NL(@"SEL$54D64B3C" "COMPADDRES3_"@"SEL$2")',
'USE_NL(@"SEL$54D64B3C" "THIS_"@"SEL$2")',
'LEADING(@"SEL$54D64B3C" "COMPSHIPME1_"@"SEL$3" "THIS_"@"SEL$2" "COMPADDRES3_"@"SEL$2")',
'INDEX_RS_ASC(@"SEL$54D64B3C" "COMPADDRES3_"@"SEL$2" ("TAB2"."COL2"))',
'INDEX_RS_ASC(@"SEL$54D64B3C" "THIS_"@"SEL$2" ("TAB1"."COL1" "TAB1"."COL2" "TAB1"."COL3" "TAB1"."COL4"))',
'INDEX_RS_ASC(@"SEL$54D64B3C" "COMPSHIPME1_"@"SEL$3" ("TAB3"."COL1"))',
'NO_ACCESS(@"SEL$1" "from$_subquery$_001"@"SEL$1")',
'OUTLINE(@"SEL$2")',
'OUTLINE(@"SEL$3")',
'MERGE(@"SEL$2")',
'OUTLINE(@"SEL$64EAE176")',
'OUTLINE(@"SEL$4")',
'OUTLINE_LEAF(@"SEL$1")',
'MERGE(@"SEL$64EAE176")',
'OUTLINE_LEAF(@"SEL$54D64B3C")',
'ALL_ROWS',
'OPT_PARAM(''_optimizer_extended_cursor_sharing_rel'' ''none'')',
'DB_VERSION(''11.2.0.4'')',
'OPTIMIZER_FEATURES_ENABLE(''11.2.0.4'')',
'IGNORE_OPTIM_EMBEDDED_HINTS',
'END_OUTLINE_DATA'

Note that we have enclosed the hints in single quotes and also put a comma at the end of every hint except the last hint.

So we have required outline hints and SQL text that should be provided as input to DBMS_SQLTUNE.IMPORT_SQL_PROFILE procedure

Step 4) Creating required profile on primary

Below shows the code for creating SQL profile on primary after providing required inputs.

declare
ar_profile_hints sys.sqlprof_attr;
begin
  ar_profile_hints := sys.sqlprof_attr(
     'BEGIN_OUTLINE_DATA',
     'USE_NL(@"SEL$54D64B3C" "COMPADDRES3_"@"SEL$2")',
     'USE_NL(@"SEL$54D64B3C" "THIS_"@"SEL$2")',
     'LEADING(@"SEL$54D64B3C" "COMPSHIPME1_"@"SEL$3" "THIS_"@"SEL$2" "COMPADDRES3_"@"SEL$2")',
     'INDEX_RS_ASC(@"SEL$54D64B3C" "COMPADDRES3_"@"SEL$2" ("TAB2"."COL2"))',
     'INDEX_RS_ASC(@"SEL$54D64B3C" "THIS_"@"SEL$2" ("TAB1"."COL1" "TAB1"."COL2" "TAB1"."COL3" "TAB1"."COL4"))',
     'INDEX_RS_ASC(@"SEL$54D64B3C" "COMPSHIPME1_"@"SEL$3" ("TAB3"."COL1"))',
     'NO_ACCESS(@"SEL$1" "from$_subquery$_001"@"SEL$1")',
     'OUTLINE(@"SEL$2")',
     'OUTLINE(@"SEL$3")',
     'MERGE(@"SEL$2")',
     'OUTLINE(@"SEL$64EAE176")',
     'OUTLINE(@"SEL$4")',
     'OUTLINE_LEAF(@"SEL$1")',
     'MERGE(@"SEL$64EAE176")',
     'OUTLINE_LEAF(@"SEL$54D64B3C")',
     'ALL_ROWS',
     'OPT_PARAM(''_optimizer_extended_cursor_sharing_rel'' ''none'')',
     'DB_VERSION(''11.2.0.4'')',
     'OPTIMIZER_FEATURES_ENABLE(''11.2.0.4'')',
     'IGNORE_OPTIM_EMBEDDED_HINTS',
     'END_OUTLINE_DATA'
 );
 DBMS_SQLTUNE.IMPORT_SQL_PROFILE(
 sql_text => 'select * from ( 
                   select /*+ index(tab1 I_TAB1_IDX1) */ this_.col1 as col1,
                          this_.col2 as col2, 
                          this_.col3 as col3, 
                          this_.col4 as col4
                   from TAB1 this_ 
                   inner join TAB2 compaddres3_ on this_.col2 = compaddres3_.col2 
                   inner join TAB3 compshipme1_ on this_.col1=compshipme1_.col1
                   where this_.col1 in (:1) 
                   and   this_.col2 in (:2 , :3) 
                   and   compshipme1_.col1 in (:4 )
                   and   this_.col8 = ''OPEN''
                   and   this_.col5>=:5 
                   and   this_.col5<=:6 
                   order by this_.col1 asc ) 
             where rownum <= :7',
 profile => ar_profile_hints,name => 'PROFILE_d7w7fruzwyh2s'
 );
end;
/

You can run above code on primary to create required profile.
Once you create profile, purge the SQL on ADG using sys.dbms_shared_pool.purge procedure.

When the SQL runs next time on ADG, it will automatically pick the above created profile and plan will be the one we tested before by including hints.

I was able to see following in the NOTE section when I did DBMS_XPLAN.DISPLAY_CURSOR on ADG

Note
-----
   - SQL profile PROFILE_d7w7fruzwyh2s used for this statement

Hope this helps !!

Reference

https://avdeo.com/2012/06/20/fixing-sql-plans-the-hard-way-part-1/

https://avdeo.com/2012/07/04/fixing-sql-plans-the-hard-way-part-2/

https://avdeo.com/2012/07/13/fixing-sql-plans-the-hard-way-part-3/

 

Using UDev to configure ASM disks

This is a small article on using UDev on RHEL 7. I have a virtual box with RHEL 7 and I am configuring ASM diskgroups.

Before we configure ASM diskgroups, we need to have disks/partitions available at OS level and those should be recognized by ASM. Oracle provides a utility called ASMLib which can be installed and used very easily to configure disk/partitions at OS level.

ASMLib stamps the header of partitions/disk at OS level with Oracle format making them easily detected by ASM instance doing disk discovery. You can go through ASMLib installation, configuration and usage in an article write by my friend Anand Prakashhttps://aprakash.wordpress.com/2016/05/19/oracle-asmlib

In this article we will focus on using UDev for setting up ASM disks.

Let’s start with some information on UDev

What is Udev?

Udev is the device manager for the Linux 2.6 kernel that creates/removes device nodes in the /dev directory dynamically. It is the successor of devfs and hotplug. It runs in userspace and the user can change device names using Udev rules.

Why Do We Need It ?

In the older kernels, the /dev directory contained statics device files. But with dynamic device creation, device nodes for only those devices which are actually present in the system are created. Let us see the disadvantages of the static /dev directory, which led to the development of Udev.

Problems Identifying the Exact Hardware Device for a Device Node in /dev

The kernel will assign a major/minor number pair when it detects a hardware device while booting the system. Let us consider two hard disks. The connection/alignment is in such a way that one is connected as a master and the other, as a slave. The Linux system will call them, /dev/hda and /dev/hdb. Now, if we interchange the disks the device name will change. This makes it difficult to identify the correct device that is related to the available static device node. The condition gets worse when there are a bunch of hard disks connected to the system.

Udev provides a persistent device naming system through the /dev directory, making it easier to identify the device.

Huge Number of Device Nodes in /dev

In the static model of device node creation, no method was available to identify the hardware devices actually present in the system. So, device nodes were created for all the devices that Linux was known to support at the time. The huge mess of device nodes in /dev made it difficult to identify the devices actually present in the system.

Not Enough Major/Minor Number Pairs

The number of static device nodes to be included increased a lot in recent times and the 8-bit scheme, that was used, proved to be insufficient for handling all the devices. As a result the major/minor number pairs started running out.

Working of Udev

The Udev daemon listens to the netlink socket that the kernel uses for communicating with user space applications. The kernel will send a bunch of data through the netlink socket when a device is added to, or removed from a system. The Udev daemon catches all this data and will do the rest, i.e., device node creation, module loading etc.

Kernel Device Event Management

  • When bootup is initialized, the /dev directory is mounted in tmpfs.
  • After that, Udev will copy the static device nodes from /lib/udev/devices to the /dev directory.
  • The Udev daemon then runs and collects uevents from the kernel, for all the devices connected to the system.
  • The Udev daemon will parse the uevent data and it will match the data with the rules specified in /etc/udev/rules.d.
  • It will create the device nodes and symbolic links for the devices as specified in the rules.
    The Udev daemon reads the rules from /etc/udev/rules.d/*.rules and stores them in the memory.
  • Udev will receive an inotify event, if any rules were changed. It will read the changes and will update the memory.

Let’s start by creating disks at OS level. VirtualBox has commands which creates block devices at OS level.

Using Udev to configure disks

Step 1) Creating Block devices at OS level

I am creating 4 disks, each of 10GB using following command

VBoxManage createhd --filename asm1.vdi --size 10240 --format VDI --variant Fixed
VBoxManage createhd --filename asm2.vdi --size 10240 --format VDI --variant Fixed
VBoxManage createhd --filename asm3.vdi --size 10240 --format VDI --variant Fixed
VBoxManage createhd --filename asm4.vdi --size 10240 --format VDI --variant Fixed

Step 2) Attaching the disk to the correct storage

VBoxManage storageattach 12102 --storagectl "SATA" --port 1 --device 0 --type hdd --medium asm1.vdi --mtype shareable
VBoxManage storageattach 12102 --storagectl "SATA" --port 2 --device 0 --type hdd --medium asm2.vdi --mtype shareable
VBoxManage storageattach 12102 --storagectl "SATA" --port 3 --device 0 --type hdd --medium asm3.vdi --mtype shareable
VBoxManage storageattach 12102 --storagectl "SATA" --port 4 --device 0 --type hdd --medium asm4.vdi --mtype shareable

12102 is the name of virtual machine. You can see the virtual machine name as shown in below screen shot.

vbox

Step 3) (optional) Making disk sharable

This is optional step and is required only if you are using ASM cluster. In that case ASM disks should be made shareable as should be attached to all machines in cluster. You can use following commands to make the disks sharable

VBoxManage modifyhd asm1.vdi --type shareable
VBoxManage modifyhd asm2.vdi --type shareable
VBoxManage modifyhd asm3.vdi --type shareable
VBoxManage modifyhd asm4.vdi --type shareable

Step 4) Check if the disks are now shown in your virtual box

Following screen shot shows that disks are now added to correct virtual machine

vbox copy

Also, after you login to virtual box, you should be able to see new devices under /dev directory

[root@advait ~]# ls -l /dev/sd*
brw-rw---- 1 root disk 8, 0 Jul 18 05:36 /dev/sda
brw-rw---- 1 root disk 8, 1 Jul 18 05:36 /dev/sda1
brw-rw---- 1 root disk 8, 2 Jul 18 05:36 /dev/sda2
brw-rw---- 1 root disk 8, 16 Jul 18 05:36 /dev/sdb
brw-rw---- 1 root disk 8, 32 Jul 18 05:36 /dev/sdc
brw-rw---- 1 root disk 8, 48 Jul 18 05:36 /dev/sdd
brw-rw---- 1 root disk 8, 64 Jul 18 05:36 /dev/sde

/dev/sda, /dev/sda1 and /dev/sda2 are partitions of main device used for virtual box.
/dev/sdb, /dev/sdc, /dev/sdd and /dev/sde are the devices we added to virtual box as above.

Step 5) Format the new devices and create partitions

You need to create new partitions using fdisk utility. I am showing the command for one partition, you can repeate the same for remaining partitions.

[root@advait ~]# fdisk /dev/sdb
Welcome to fdisk (util-linux 2.23.2).

Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table
Building a new DOS disklabel with disk identifier 0xbda01838.

Command (m for help): n
Partition type:
 p primary (0 primary, 0 extended, 4 free)
 e extended
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-20971519, default 2048): 
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-20971519, default 20971519): 
Using default value 20971519
Partition 1 of type Linux and of size 10 GiB is set

Command (m for help): p

Disk /dev/sdb: 10.7 GB, 10737418240 bytes, 20971520 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0xbda01838

Device Boot Start End Blocks Id System
/dev/sdb1 2048 20971519 10484736 83 Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

This will create single partition for each of the devices.

[root@advait ~]# ls -rlt /dev/sd*1
brw-rw---- 1 root disk 8, 1 Jul 18 05:36 /dev/sda1
brw-rw---- 1 root disk 8, 17 Jul 18 05:40 /dev/sdb1
brw-rw---- 1 root disk 8, 33 Jul 18 05:40 /dev/sdc1
brw-rw---- 1 root disk 8, 49 Jul 18 05:41 /dev/sdd1
brw-rw---- 1 root disk 8, 65 Jul 18 05:41 /dev/sde1

Step 5) Configure UDev rules

So before creating UDev rule, we need to understand what exactly we want to do. We want to create alias for each of the disk that we created at OS level so that it is always identified the same way, regardless of the device name Linux assigns it. This can be done by recognizing each device based on some unique IDs and assining the alias to that device. UDev can do just that.

Each disk has a unique SCSI ID. We can use this unique ID to recognize the disk and assign the required alias to that disk.
We can get unique SCSI_ID using following command

[root@advait ~]# /usr/lib/udev/scsi_id -g -u -d /dev/sdb1
1ATA_VBOX_HARDDISK_VB544d069c-abd3901e

scsi_id command is located in /usr/lib/udev/ directory on RHEL 7. But in previous release this used to be in /sbin/ location.

Like wise we can find SCSI_ID for each disk that we added.

Rules are defined in “/etc/udev/rules.d” directory. Udev reads these rules and apply them to devices listed in /dev directory.

Rules looks like below

KERNEL=="sd?1", SUBSYSTEM=="block", -
PROGRAM=="/usr/lib/udev/scsi_id -g -u -d /dev/$parent", -
RESULT=="1ATA_VBOX_HARDDISK_VB88ee7178-f28aa887", -
SYMLINK+="asm-disk1", OWNER="oracle", GROUP="dba", MODE="0660"

Following is the explanation of each parameter

  • KERNEL==”sd?1″ – This matches the kernel name of the device. In our case all our partitions are having names as sd?1 (sdb1, sdc1 etc). So this match key matches the kernel name of the devices
  • SUBSYSTEM==”block” – This match key matches the subsystem of the devices. SUBSYSTEM could be block, scsi, usb etc. We have all block devices.
  • PROGRAM==”/usr/lib/udev/scsi_id -g -u -d /dev/$parent” – This will get the unique SCSI_ID for the device searched by first 2 match parameters (KERNEL and SUBSYSTEM)
  • RESULT==”1ATA_VBOX_HARDDISK_VB544d069c-abd3901e” – This will match the output of PROGRAM command with RESULT. If the result matches, then further action will be taken
  • SYMLINK+=”asm-disk1″ – This parameter is part of action key. If PROGRAM output matches RESULT, then a SYMLINK will be created, which is named asm-disk1 and will be pointing to the device in question.
  • OWNER=”oracle” – This parameter is also part of action. This will change the ownership of device to oracle user
  • GROUP=”dba” – This parameter changes the group of device to dba group
  • MODE=”0660″ – This changes the permission of device file to 0660

So above rule means that the device pointing to the partition “sd*1” on the disk with the SCSI ID of “1ATA_VBOX_HARDDISK_VB88ee7178-f28aa887” will always be referred with symlink “/dev/asm-disk1” pointing to the device, regardless of the letter “?” Linux assigns when the device is discovered. In addition, the device will have the correct ownership and permissions for ASM.

We can create such rule for each of the device or if number of devices are huge, we can use wildcard and matching patterns to create more intelligent rules to search and take required actions.

I have created following rules and we can create a new rule file /etc/udev/rules.d/99-oracle-asm-disks.rules and put following rules, one for each device

KERNEL=="sd?1", SUBSYSTEM=="block", PROGRAM=="/usr/lib/udev/scsi_id -g -u -d /dev/$parent", RESULT=="1ATA_VBOX_HARDDISK_VB544d069c-abd3901e", SYMLINK+="asm-disk1", OWNER="oracle", GROUP="dba", MODE="0660"
KERNEL=="sd?1", SUBSYSTEM=="block", PROGRAM=="/usr/lib/udev/scsi_id -g -u -d /dev/$parent", RESULT=="1ATA_VBOX_HARDDISK_VB9a630cc5-d9697727", SYMLINK+="asm-disk2", OWNER="oracle", GROUP="dba", MODE="0660"
KERNEL=="sd?1", SUBSYSTEM=="block", PROGRAM=="/usr/lib/udev/scsi_id -g -u -d /dev/$parent", RESULT=="1ATA_VBOX_HARDDISK_VB446fdf92-8640efff", SYMLINK+="asm-disk3", OWNER="oracle", GROUP="dba", MODE="0660"
KERNEL=="sd?1", SUBSYSTEM=="block", PROGRAM=="/usr/lib/udev/scsi_id -g -u -d /dev/$parent", RESULT=="1ATA_VBOX_HARDDISK_VB3a71b4f2-8c603b78", SYMLINK+="asm-disk4", OWNER="oracle", GROUP="dba", MODE="0660"

Step 6) Load updated block device partition tables

You can use partprobe to load the partition tables for block devices

/sbin/partprobe /dev/sdb1
/sbin/partprobe /dev/sdc1
/sbin/partprobe /dev/sdd1
/sbin/partprobe /dev/sde1

Step 7) Test the rules

This is optional step to check if the rules are working as expected. You can run following commands to test the rules

udevadm test /block/sdb/sdb1
udevadm test /block/sdb/sdc1
udevadm test /block/sdb/sdd1
udevadm test /block/sdb/sde1

The output for one of the above command looks like following

calling: test
version 219
This program is for debugging only, it does not run any program
specified by a RUN key. It may show incorrect results, because
some values may be different, or not available at a simulation run.

=== trie on-disk ===
tool version: 219
file size: 6984832 bytes
header size 80 bytes
strings 1805856 bytes
nodes 5178896 bytes
Load module index
Created link configuration context.
timestamp of '/etc/udev/rules.d' changed
Reading rules file: /usr/lib/udev/rules.d/10-dm.rules
Reading rules file: /usr/lib/udev/rules.d/100-balloon.rules
Reading rules file: /usr/lib/udev/rules.d/13-dm-disk.rules
Reading rules file: /usr/lib/udev/rules.d/40-redhat.rules
Reading rules file: /usr/lib/udev/rules.d/42-usb-hid-pm.rules
Reading rules file: /usr/lib/udev/rules.d/50-udev-default.rules
Reading rules file: /usr/lib/udev/rules.d/60-alias-kmsg.rules
Reading rules file: /usr/lib/udev/rules.d/60-cdrom_id.rules
Reading rules file: /usr/lib/udev/rules.d/60-drm.rules
Reading rules file: /usr/lib/udev/rules.d/60-keyboard.rules
Reading rules file: /usr/lib/udev/rules.d/60-net.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-alsa.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-input.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-serial.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-storage-tape.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-storage.rules
Reading rules file: /usr/lib/udev/rules.d/60-persistent-v4l.rules
Reading rules file: /usr/lib/udev/rules.d/60-raw.rules
Reading rules file: /usr/lib/udev/rules.d/61-accelerometer.rules
Reading rules file: /usr/lib/udev/rules.d/64-btrfs.rules
Reading rules file: /usr/lib/udev/rules.d/70-mouse.rules
Reading rules file: /usr/lib/udev/rules.d/70-power-switch.rules
Reading rules file: /usr/lib/udev/rules.d/70-touchpad.rules
Reading rules file: /usr/lib/udev/rules.d/70-uaccess.rules
Reading rules file: /usr/lib/udev/rules.d/71-biosdevname.rules
Reading rules file: /usr/lib/udev/rules.d/71-seat.rules
Reading rules file: /usr/lib/udev/rules.d/73-idrac.rules
Reading rules file: /usr/lib/udev/rules.d/73-seat-late.rules
Reading rules file: /usr/lib/udev/rules.d/75-net-description.rules
Reading rules file: /usr/lib/udev/rules.d/75-probe_mtd.rules
Reading rules file: /usr/lib/udev/rules.d/75-tty-description.rules
Reading rules file: /usr/lib/udev/rules.d/78-sound-card.rules
Reading rules file: /usr/lib/udev/rules.d/80-drivers.rules
Reading rules file: /usr/lib/udev/rules.d/80-net-name-slot.rules
Reading rules file: /usr/lib/udev/rules.d/80-net-setup-link.rules
Reading rules file: /usr/lib/udev/rules.d/81-kvm-rhel.rules
Reading rules file: /usr/lib/udev/rules.d/85-nm-unmanaged.rules
Reading rules file: /usr/lib/udev/rules.d/90-alsa-tools-firmware.rules
Reading rules file: /usr/lib/udev/rules.d/90-iprutils.rules
Reading rules file: /usr/lib/udev/rules.d/90-vconsole.rules
Reading rules file: /usr/lib/udev/rules.d/91-drm-modeset.rules
Reading rules file: /usr/lib/udev/rules.d/95-dm-notify.rules
Reading rules file: /usr/lib/udev/rules.d/95-udev-late.rules
Reading rules file: /usr/lib/udev/rules.d/98-kexec.rules
Reading rules file: /usr/lib/udev/rules.d/98-rdma.rules
Reading rules file: /etc/udev/rules.d/99-oracle-asm-disks.rules
Reading rules file: /usr/lib/udev/rules.d/99-systemd.rules
rules contain 24576 bytes tokens (2048 * 12 bytes), 12216 bytes strings
1803 strings (22584 bytes), 1184 de-duplicated (10988 bytes), 620 trie nodes used
GROUP 6 /usr/lib/udev/rules.d/50-udev-default.rules:52
LINK 'disk/by-id/ata-VBOX_HARDDISK_VB544d069c-abd3901e-part1' /usr/lib/udev/rules.d/60-persistent-storage.rules:43
IMPORT builtin 'blkid' /usr/lib/udev/rules.d/60-persistent-storage.rules:72
probe /dev/sdb1 raid offset=0
PROGRAM '/usr/lib/udev/scsi_id -g -u -d /dev/sdb' /etc/udev/rules.d/99-oracle-asm-disks.rules:1
starting '/usr/lib/udev/scsi_id -g -u -d /dev/sdb'
'/usr/lib/udev/scsi_id -g -u -d /dev/sdb'(out) '1ATA_VBOX_HARDDISK_VB544d069c-abd3901e'
'/usr/lib/udev/scsi_id -g -u -d /dev/sdb' [10122] exit with return code 0
OWNER 54321 /etc/udev/rules.d/99-oracle-asm-disks.rules:1
GROUP 54322 /etc/udev/rules.d/99-oracle-asm-disks.rules:1
MODE 0660 /etc/udev/rules.d/99-oracle-asm-disks.rules:1
LINK 'asm-disk1' /etc/udev/rules.d/99-oracle-asm-disks.rules:1
PROGRAM '/usr/lib/udev/scsi_id -g -u -d /dev/sdb' /etc/udev/rules.d/99-oracle-asm-disks.rules:2
starting '/usr/lib/udev/scsi_id -g -u -d /dev/sdb'
'/usr/lib/udev/scsi_id -g -u -d /dev/sdb'(out) '1ATA_VBOX_HARDDISK_VB544d069c-abd3901e'
'/usr/lib/udev/scsi_id -g -u -d /dev/sdb' [10123] exit with return code 0
PROGRAM '/usr/lib/udev/scsi_id -g -u -d /dev/sdb' /etc/udev/rules.d/99-oracle-asm-disks.rules:3
starting '/usr/lib/udev/scsi_id -g -u -d /dev/sdb'
'/usr/lib/udev/scsi_id -g -u -d /dev/sdb'(out) '1ATA_VBOX_HARDDISK_VB544d069c-abd3901e'
'/usr/lib/udev/scsi_id -g -u -d /dev/sdb' [10124] exit with return code 0
PROGRAM '/usr/lib/udev/scsi_id -g -u -d /dev/sdb' /etc/udev/rules.d/99-oracle-asm-disks.rules:4
starting '/usr/lib/udev/scsi_id -g -u -d /dev/sdb'
'/usr/lib/udev/scsi_id -g -u -d /dev/sdb'(out) '1ATA_VBOX_HARDDISK_VB544d069c-abd3901e'
'/usr/lib/udev/scsi_id -g -u -d /dev/sdb' [10125] exit with return code 0
handling device node '/dev/sdb1', devnum=b8:17, mode=0660, uid=54321, gid=54322
preserve permissions /dev/sdb1, 060660, uid=54321, gid=54322
preserve already existing symlink '/dev/block/8:17' to '../sdb1'
found 'b8:17' claiming '/run/udev/links/\x2fasm-disk1'
creating link '/dev/asm-disk1' to '/dev/sdb1'
preserve already existing symlink '/dev/asm-disk1' to 'sdb1'
found 'b8:17' claiming '/run/udev/links/\x2fdisk\x2fby-id\x2fata-VBOX_HARDDISK_VB544d069c-abd3901e-part1'
creating link '/dev/disk/by-id/ata-VBOX_HARDDISK_VB544d069c-abd3901e-part1' to '/dev/sdb1'
preserve already existing symlink '/dev/disk/by-id/ata-VBOX_HARDDISK_VB544d069c-abd3901e-part1' to '../../sdb1'
created db file '/run/udev/data/b8:17' for '/block/sdb/sdb1'
ACTION=add
DEVLINKS=/dev/asm-disk1 /dev/disk/by-id/ata-VBOX_HARDDISK_VB544d069c-abd3901e-part1
DEVNAME=/dev/sdb1
DEVPATH=/block/sdb/sdb1
DEVTYPE=partition
ID_ATA=1
ID_ATA_FEATURE_SET_PM=1
ID_ATA_FEATURE_SET_PM_ENABLED=1
ID_ATA_SATA=1
ID_ATA_SATA_SIGNAL_RATE_GEN2=1
ID_ATA_WRITE_CACHE=1
ID_ATA_WRITE_CACHE_ENABLED=1
ID_BUS=ata
ID_MODEL=VBOX_HARDDISK
ID_MODEL_ENC=VBOX\x20HARDDISK\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
ID_PART_ENTRY_DISK=8:16
ID_PART_ENTRY_NUMBER=1
ID_PART_ENTRY_OFFSET=2048
ID_PART_ENTRY_SCHEME=dos
ID_PART_ENTRY_SIZE=20969472
ID_PART_ENTRY_TYPE=0x83
ID_PART_TABLE_TYPE=dos
ID_REVISION=1.0
ID_SERIAL=VBOX_HARDDISK_VB544d069c-abd3901e
ID_SERIAL_SHORT=VB544d069c-abd3901e
ID_TYPE=disk
MAJOR=8
MINOR=17
SUBSYSTEM=block
TAGS=:systemd:
USEC_INITIALIZED=404668
Unload module index
Unloaded link configuration context.

Step 8) Reload rules of udev

Use following command to reload the rules

udevadm control --reload-rules

Check if the required symlinks and other actions are performed on the devices or not

[root@advait ~]# ls -rlt /dev/asm*
lrwxrwxrwx 1 root root 4 Jul 18 06:20 /dev/asm-disk2 -> sdc1
lrwxrwxrwx 1 root root 4 Jul 18 06:20 /dev/asm-disk3 -> sdd1
lrwxrwxrwx 1 root root 4 Jul 18 06:20 /dev/asm-disk4 -> sde1
lrwxrwxrwx 1 root root 4 Jul 18 06:20 /dev/asm-disk1 -> sdb1

Above symlinks are owned by root, but devices will be owned by oracle:dba

[root@advait ~]# ls -rlt /dev/sd?1
brw-rw---- 1 root disk 8, 1 Jul 18 05:36 /dev/sda1
brw-rw---- 1 oracle dba 8, 33 Jul 18 06:20 /dev/sdc1
brw-rw---- 1 oracle dba 8, 49 Jul 18 06:20 /dev/sdd1
brw-rw---- 1 oracle dba 8, 65 Jul 18 06:20 /dev/sde1
brw-rw---- 1 oracle dba 8, 17 Jul 18 06:20 /dev/sdb1

Now ASM can identify the disks as /dev/asm-disk* and these symlinks will persist with host reboot.

Hope this helps !!

References:

https://www.linux.com/news/udev-introduction-device-management-modern-linux-system
http://www.reactivated.net/writing_udev_rules.html

Previewing Backup Restore

Introduction:

This is a short article on RMAN where we can check if our backups are really intact and can help us in critical situation when we have to restore and recover the database.

We have a command option called PREVIEW which we can use with RESTORE DATABASE. This option does not actually restore the datafiles from backup but it just tell us SCN number until which we should be recovering our database. It also tells us SCN number until which it can restore.

Lets take a simple example where I have a complete database backup taken for my database including archivelogs.

I will run RESTORE DATABASE PREVIEW command to check if I can get back my database in case of any issues. Then I will wipe out my database and try to recover from the backup to validate if PREVIEW option was showing correct information.

Validating Backupsets

We can validate if our backupsets are intact and does not have any corruption. We can check both physical and logical corruption for the backupsets and make sure they are good and can be used for restore.

Following RMAN command will confirm the same

Physical corruption validation

 


RMAN> backup validate database archivelog all;

Starting backup at 08-FEB-16
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=272 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=13 device type=DISK
channel ORA_DISK_1: starting compressed full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
...
...
...

Logical corruption validation


RMAN> backup validate check logical database archivelog all;

Starting backup at 08-FEB-16
using channel ORA_DISK_1
using channel ORA_DISK_2
channel ORA_DISK_1: starting compressed full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00008 name=/u01/app/oracle/oradata/orcl/datafiles/users02.dbf
...
...
...

If above command doesn’t report any issues, we can be sure that backupsets are intact and can be used for restore/recovery.

Checking PREVIEW

We can use “RESTORE DATABASE PREVIEW” command in RMAN to get preview of all backup sets we are going to use and until what SCN this backup will be able to restore.

This command also gives us the SCN number until which we need to recovery the DB to get all datafiles out of fuzzy status


RMAN> restore database preview;

Starting restore at 08-FEB-16
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=58 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=264 device type=DISK
List of Backup Sets
===================
BS Key Type LV Size Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
48 Full 209.01M DISK 00:00:43 08-FEB-16
BP Key: 48 Status: AVAILABLE Compressed: YES Tag: TAG20160208T053652
Piece Name: /u01/app/oracle/fast_recovery_area/ORCL/backupset/2016_02_08/o1_mf_nnndf_TAG20160208T053652_ccjvy4y2_.bkp
List of Datafiles in backup set 48
File LV Type Ckp SCN Ckp Time Name
---- -- ---- ---------- --------- ----
1 Full 2955419 08-FEB-16 /u01/app/oracle/oradata/orcl/datafiles/system01.dbf
4 Full 2955419 08-FEB-16 /u01/app/oracle/oradata/orcl/datafiles/undotbs01.dbf
8 Full 2955419 08-FEB-16 /u01/app/oracle/oradata/orcl/datafiles/users02.dbf

BS Key Type LV Size Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
47 Full 152.79M DISK 00:00:41 08-FEB-16
BP Key: 47 Status: AVAILABLE Compressed: YES Tag: TAG20160208T053652
Piece Name: /u01/app/oracle/fast_recovery_area/ORCL/backupset/2016_02_08/o1_mf_nnndf_TAG20160208T053652_ccjvy4yx_.bkp
List of Datafiles in backup set 47
File LV Type Ckp SCN Ckp Time Name
---- -- ---- ---------- --------- ----
3 Full 2955421 08-FEB-16 /u01/app/oracle/oradata/orcl/datafiles/sysaux01.dbf
6 Full 2955421 08-FEB-16 /u01/app/oracle/oradata/orcl/datafiles/users01.dbf
9 Full 2955421 08-FEB-16 /u01/app/oracle/oradata/orcl/datafiles/users03.dbf

...
...
...
List of Backup Sets
===================
BS Key Size Device Type Elapsed Time Completion Time
------- ---------- ----------- ------------ ---------------
54 3.97M DISK 00:00:01 08-FEB-16
BP Key: 54 Status: AVAILABLE Compressed: YES Tag: TAG20160208T055120
Piece Name: /u01/app/oracle/fast_recovery_area/ORCL/backupset/2016_02_08/o1_mf_annnn_TAG20160208T055120_ccjws8m9_.bkp

List of Archived Logs in backup set 54
Thrd Seq Low SCN Low Time Next SCN Next Time
---- ------- ---------- --------- ---------- ---------
1 43 2955401 08-FEB-16 2956061 08-FEB-16

BS Key Size Device Type Elapsed Time Completion Time
------- ---------- ----------- ------------ ---------------
55 80.50K DISK 00:00:00 08-FEB-16
BP Key: 55 Status: AVAILABLE Compressed: YES Tag: TAG20160208T055120
Piece Name: /u01/app/oracle/fast_recovery_area/ORCL/backupset/2016_02_08/o1_mf_annnn_TAG20160208T055120_ccjws9xm_.bkp

List of Archived Logs in backup set 55
Thrd Seq Low SCN Low Time Next SCN Next Time
---- ------- ---------- --------- ---------- ---------
1 44 2956061 08-FEB-16 2956142 08-FEB-16
1 45 2956142 08-FEB-16 2956308 08-FEB-16
RMAN-05119: recovery can not be done to a consistent state.
Media recovery start SCN is 2955419
Recovery must be done beyond SCN 2955603 to clear datafile fuzziness
Finished restore at 08-FEB-16

If you check the last 4 lines, it will show the SCN number until which RMAN can restore the backups – SCN 2955419

From SCN 2955419 we need to start applying archive logs. We can identify the sequence number in which this SCN falls using below SQL


SQL> select sequence#, first_change#, next_change# from v$archived_log

where FIRST_CHANGE# <= 2955419 and NEXT_CHANGE# >= 2955419;

SEQUENCE#  FIRST_CHANGE# NEXT_CHANGE#
---------- ------------- ------------
43         2955401       2956061

So we will need archives from sequence# 43 and we need to recover beyond SCN 2955603.

Trying restore/recover

Lets see I have few/all datafiles missing for database. I tried to bounce the DB and got following error


SQL> startup
ORACLE instance started.

Total System Global Area 943718400 bytes
Fixed Size 2931136 bytes
Variable Size 348128832 bytes
Database Buffers 587202560 bytes
Redo Buffers 5455872 bytes
Database mounted.
ORA-01157: cannot identify/lock data file 1 - see DBWR trace file
ORA-01110: data file 1: '/u01/app/oracle/oradata/orcl/datafiles/system01.dbf'

Lets try to restore


[oracle@advait ~]$ rman target /

Recovery Manager: Release 12.1.0.2.0 - Production on Mon Feb 8 06:56:26 2016

Copyright (c) 1982, 2014, Oracle and/or its affiliates. All rights reserved.

connected to target database: ORCL (DBID=1429382412, not open)

RMAN> restore database;

Starting restore at 08-FEB-16
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=248 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=13 device type=DISK

skipping datafile 5; already restored to file /u01/app/oracle/oradata/orcl/pdbseed/system01.dbf
skipping datafile 7; already restored to file /u01/app/oracle/oradata/orcl/pdbseed/sysaux01.dbf
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00003 to /u01/app/oracle/oradata/orcl/datafiles/sysaux01.dbf
channel ORA_DISK_1: restoring datafile 00006 to /u01/app/oracle/oradata/orcl/datafiles/users01.dbf
channel ORA_DISK_1: restoring datafile 00009 to /u01/app/oracle/oradata/orcl/datafiles/users03.dbf
...
...
...
channel ORA_DISK_2: restored backup piece 1
channel ORA_DISK_2: restore complete, elapsed time: 00:01:01
channel ORA_DISK_1: piece handle=/u01/app/oracle/fast_recovery_area/ORCL/2B0D04DCEFAF55F5E0531438A8C0EDC4/backupset/2016_02_08/o1_mf_nnndf_TAG20160208T053652_ccjvzlmt_.bkp tag=TAG20160208T053652
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:35
Finished restore at 08-FEB-16

RMAN>

Once restore complete, lets try to recover


SQL> recover database;
ORA-00279: change 2955419 generated at 02/08/2016 05:36:52 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/fast_recovery_area/ORCL/archivelog/2016_02_08/o1_mf_1_43_%u_.arc
ORA-00280: change 2955419 for thread 1 is in sequence #43
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

As expected, its asking for recovery starting from sequence# 43.

We can restore the required archivelogs as well


RMAN> restore archivelog from sequence 43;

Starting restore at 08-FEB-16
using channel ORA_DISK_1
using channel ORA_DISK_2

channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=43
channel ORA_DISK_1: reading from backup piece /u01/app/oracle/fast_recovery_area/ORCL/backupset/2016_02_08/o1_mf_annnn_TAG20160208T055120_ccjws8m9_.bkp
...
...
...
channel ORA_DISK_2: restored backup piece 1
channel ORA_DISK_2: restore complete, elapsed time: 00:00:01
Finished restore at 08-FEB-16

RMAN>

Once archives are restored, we can carry on the recovery


SQL> recover database
ORA-00279: change 2955419 generated at 02/08/2016 05:36:52 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/fast_recovery_area/ORCL/archivelog/2016_02_08/o1_mf_1_43_cck2fwv
k_.arc
ORA-00280: change 2955419 for thread 1 is in sequence #43
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/u01/app/oracle/fast_recovery_area/ORCL/archivelog/2016_02_08/o1_mf_1_43_cck2fwvk_.arc
Log applied.
Media recovery complete.
SQL> alter database open;

Database altered.

SQL>

In this case it old took sequence# 43 as NEXT_CHANGE# for sequence# 43 was 2956061 which is more than SCN 2955603 required to clear the fuzzy state. Rest of the redo information was also present in online redo logs so database did crash recovery while doing the open.

Hope this helps !!

Library Cache: Mutex X – Bug 20879889 – Fixed in 11.2.0.4

I recently encountered a bug related to MView log causing very high library cache: mutex x wait events.

I will brief about the debugging steps I tried and fix for the same.

Few things to note before I proceed:-

  • We observed huge wait events for library cache: mutex X whenever we performed flip to standby or when DB was bounced. I am implying that library cache was cold and didn’t had required cursor information and object handles.
  • Load on the database was very high. This includes very high number of sessions connected to database and high number of DMLs performed on same table by many sessions.
  • Table on which DMLs are performed is having MLOG created on that based on primary key. This is required as these changes and data needs to flow to downstream databases via MView (fast refresh). So around 10 downstream sites were registered on this MLOG.

Root Cause Analysis:

After looking at the wait events, we immediately started digging into the root cause. We wanted to understand the bottleneck which is causing these wait events and remove whatever it takes to reduce these wait events.

I started with checking top 10 wait events in last 20 mins from v$active_session_history


SQL>select * from (select event, count(1) from v$active_session_history
2 where sample_time > (sysdate - 20/1440) group by event order by 2 desc) where rownum < 10;

EVENT                                    COUNT(1)
---------------------------------------- ----------
library cache: mutex X                        50943
checkpoint completed                          15170
read by other session                          5487
row cache lock                                 4205
log file sync                                  3137
flashback buf free by RVWR                     1815
db file sequential read                        1675
log file switch completion                     1611
cursor: pin S wait on X                        1516

9 rows selected.

Clearly library cache: mutex X was way higher.

We can check who is causing library cache: mutex X by checking P1 and P2 of that wait event


SQL>select event, p1, count(1) from v$active_session_history where sample_time > (sysdate - 20/1440) and event = 'library cache: mutex X' group by event, p1 order by 3;

EVENT                                    P1         COUNT(1)
---------------------------------------- ---------- ----------
library cache: mutex X                    421399181          1
library cache: mutex X                   3842104349          1
library cache: mutex X                   3477606468          1
library cache: mutex X                   2432877226          1
library cache: mutex X                    955484820          2
library cache: mutex X                        37005         25
library cache: mutex X                   1412465886        297
library cache: mutex X                   2417922189      50615

8 rows selected.

As you can see max wait events are caused by P1 = 2417922189

P1 is idn – can be used for finding the cursor related to mutex

For library cache: mutex X

  1. P1 represents library cache hash bucket number (if idn <= 131072)
  2. P1 represents hash value of the library cache object under protection (if idn > 131072)

In our case hash value was 2417922189. So it represents library cache object.

We can check if this hash value belongs to a cursor (SQL) using v$sql


SQL>select sql_id from v$sql where hash_value = 2417922189;

SQL_ID
-------------
3b7aa6f81x44d

SQL>@sql
Enter SQL_ID:- 3b7aa6f81x44d
old 3: where sql_id = '&SQL_ID'
new 3: where sql_id = '3b7aa6f81x44d'

INSERT /*+ IDX(0) */ INTO "BOOKER"."MLOG$_FULFILLMENT_DEMAND_I"
(dmltype$$,old_new$$,snaptime$$,change_vector$$,xid$$,"WAREHOUSE
_ID","CO_WAREHOUSE_ITEM_ID") VALUES (:d,:o,to_date('4000-01-01:0
0:00:00','YYYY-MM-DD:HH24:MI:SS'),:c,:x,:1,:2)

As you can see this is DML on MLOG table. So clearly MLOG seems to be the bottleneck.

Immediate Action:

Wait events were making every thing stand still and nothing was getting processed. Since database availability was critical and we couldn’t afford to loose any more time because of wait events, our immediate action was to drop MLOG.

But since DB was having thousands of sessions and all stuck in library cache: mutex X, we cannot even get a lock to drop MLOG (not even with ddl_lock_timeout). Killing a session was not helping as they were keep coming back in no time.

So we decided to take quick bounce of DB (by keeping listeners down), dropped this MLOG and started listeners again.

This has fixed the issue and there were no wait events.

On down side, all downstream sites had to do complete refresh of MView followed by fast refresh as they got deregistered from MLOG. But it was OK as size of master table was not very big.

Patch Fix:

We also tried to investigate why an MLOG can cause this issue ? But were not able to get any insight.

MLOG is a table where primary key of changed records in main table will get inserted and when downstream table refreshes the data, these records will get deleted. So we don’t expect much higher size of MLOG. Unfortunately this is case it was 7.7GB (bigger than master table size).

Because the size of MLOG was too high, whenever a downstream database is doing fast refresh it has to update SNAPTIME$$ column in MLOG table which tells MLOG which rows are refreshed by downstream DB at what time. This update might be taking more time and holding enqueue locks. But inserts should not be blocked because of this update as it inserts a new record and doesn’t collied at enqueue level. Also, we were seeing these library cache wait events just after bounce or failover.

This is identified as a bug and Oracle has provided following patch as a fix for this issue

Patch 20879889: INSERT INTO MV LOG LEAVING TOO MANY OPEN CURSORS AFTER UPGR TO 11.2.0.4

Make sure you have this patch applied if you are using MLOG in your database.

Hope this helps !!

Reference:

https://support.oracle.com/epmos/faces/PatchDetail?requestId=19179589&_afrLoop=14055698364071&patchId=20879889&_afrWindowMode=0&_adf.ctrl-state=xqyu7jtft_77

 

 

Mutex: What do we know ?

In my previous article on Latches, I mentioned various things related to latches including different types and their behavior.

In this article I will describe similar things about mutexes.

Mutex are low level serialization / locking mechanism to protect memory structure inside library cache.

Why there was a change from Latches ? 

Mutex was introduced from 10.2 onwards and have been proved very efficient in managing library cache operations.

Mutex takes less memory then latches. Typically latch structure takes around 110 bytes on 32 bit system whereas mutex takes only 30 bytes. This variation is because of number of instructions it takes to acquire latch vs mutex. Latch takes around 150-200 instructions where as mutex takes 30-35 instructions. But the down side is that mutex gives less information about the waits and blockers compared to latches.

As we know the basics of library cache architecture, it consists of hash buckets and each bucket will contain linked list of library cache object handles. Whenever an access to library cache object happens, it is hashed to a bucket and corresponding latch should be obtained to traverse the chain in that bucket. Latch will be released when corresponding object is found or not found.

But there are only 67 library cache latches available to protect 131,072 buckets created by default in library cache. So single latch covers multiple buckets. This creates a false contention, meaning if 2 process are trying to access 2 different buckets protected by same latch, one of them have to wait until other completes traversing its bucket. So even though they are not colliding on same bucket they still will be blocked on each other because latch mechanism.

Further, if required object is found in library cache, process needs to pin that object while it is being used. Pin will basically protect object so that no other process can modify that object or that object will not be discarded from memory while you are accessing. When you are done, you need to take a latch again to unpin the object. These latches to pin and unpin needs memory allocation and deallocation to happen, which is expensive.

Compared to above process, mutex has many benefits

As against 67 latches to protect 131,072 buckets, Oracle introduced 131,072 mutexes to protect each bucket. So false contention is reduced drastically. False contention can still occur if 2 process want to access to 2 different objects which belongs to same bucket but its very rare.

After a mutex is obtained process will traverse the chain until it finds the required object. Once an object is found, a process needs to pin the object to access the same. But in case of mutex, there is no need for “latch: library cache pin”, instead mutex itself will act as a pin. Mutex acts as serialization mechanism to traverse a linked list as well as cursor pin structure. A mutex pin can be referenced by multiple sessions, providing that all sessions reference the mutex in shared mode only. The total number of sessions referencing a mutex in shared (S) mode is called reference count (ref count). The ref count for a mutex is stored in the mutex itself. Whenever a session wants to pin an object in shared pool, it increments this ref count of the mutex associated with that object. This is much faster and no need for any memory allocation or deallocation. Once process is done with that object, it will reduce the ref count. An object is not discarded from memory until ref count is zero.

Why not to increase number of latches ?

a. latches needs more memory and they are allocated upfront. As against Mutexes which are dynamic and are created when requested, latches are created when instance starts and its memory locations are externalized in view v$latch, v$latch_children

b. more latches you have, the more work you may have to do in some form of maintenance, reporting, or garbage collection. This will increase processing demand on server.

Now that we know why Oracle introduced mutex, lets move further in understanding more.

Shared, Exclusive and Examine mode:

Shared mode: Multiple sessions can access the memory structure protected by mutex in shared mode. Meaning every session has read access to structure protected by mutex. Every time a session access or acquire mutex in shared mode, ref count of mutex needs to be updated in mutex structure. Number of sessions accessing a mutex can be seen in lower bytes of P3 value of mutex wait event (cursor: pin S).

Exclusive mode: This mode is incompatible with all modes. Only 1 session can hold the lock in exclusive mode. In this mode, upper bytes of P3 value of mutex wait events represents the SID of holding session.

Examine mode: This mode indicates that mutex is in transition phase from shared mode to exclusive mode. During this time no other session can access or modify the mutex. In this mode, upper bytes of P3 value of mutex wait events represents the SID of holding session and lower bytes represents number of session holding mutex in shared mode.

Mutex Acquisition:

So how mutex acquisition works ?

We have seen in the past article on latches about how latch acquisition has worked. Mutex acquisition is similar to latch acquisition in that it tries for immediate gets and if its not able to acquire, it will spin followed by sleep (but sleep depends on oracle version).

Over past years with different releases of Oracle, mutex acquisition algorithm has changed drastically and I would say it got more stabilized with new releases. In Oracle 10.2, mutex was used for only pinning the object in library cache (only when _kks_use_mutex_pin=true) and there was no sleep.

Below algorithm applies to other mutexes which were introduced in 11g.

Oracle 10.2 – 11.1

Following is the mutex acquisition algorithm in 10.2 – 11.2

- Immediate gets mutex
      - spin gets mutex
            - yield()

Session will try to acquire mutex in immediate mode, if mutex is not available, it will spin. If still mutex is not available it will yield CPU, meaning, that process will be placed at the end of run queue but it will still be “ON CPU”. Because of this, CPU consumption was very high. Also because there was no sleep, wait interface was not able to record actual time waited for mutex acquisition. In AWR report, in top 5 timed events, we could see high number of waits, but total time waited used to be very low and CPU time always used to be close to DB time.

If CPU resources are not available OS will de-schedule such process (which is spinning and yielding CPU) and at that time that actual wait event for cursor: pin S gets recorded. But if a system has lot of CPU (like in many systems), that process will not even get de-scheduled, and Oracle thinks it is not waiting at all.

Therefore, in Oracle 10.2-11.1 the “cursor: pin S” wait time is the pure wait for CPU

For systems which do not have enough spare CPU, “cursor: pin S” used to be top wait events because it used to get de-scheduled from run queue and also whenever it used to spin and yield CPU consumption used to sky rocket. These are the systems where mutex issue was coming into light.

Because of these issue, Oracle released patch 6904068: High CPU usage when there are “cursor: pin S” waits.

This patch was released for 11.1 and later back ported to 10.2. With this patch, Oracle introduced underscore parameter – _first_spare_parameter. Default value of this parameter was 1 (centisec). This parameter provides a sleep time for mutex if its not acquired after first spin. So mutex behavior became similar to latch behavior but wait timeout of 10 ms was too high.

On better side, CPU consumption on system decreased and because of sleep time, this wait event (cursor: pin S) was shown up correctly on top 5 timed event (because total time waited was quiet accurate because of sleep). Parameter _first_spare_parameter was dynamic and if set to 0, mutex acquisition will behave without sleep (aggregated CPU usage).

But will all above discussion, question arises – if cursor: pin S is a shared mutex which can be acquired by multiple sessions concurrently why there would be blocking or wait events ?

Well couple of scenario when we can see this wait event

  1. If a mutex is in transition state “E” as mentioned above
  2. Everytime a session acquires mutex in shared mode, it has to update ref count of that mutex. That can cause contention.

What happened in 11.2 ?

Version 11.2.0.1

Behavior of mutex in 11.2.0.1 is same as 10.2. It was aggressive with no sleep. But it was visible as top event in-spite of no sleep calls.

This happened because Oracle counted wait time as time between first spin and successful mutex acquisition. So all the yield happening after the first spin was considered in waits. So total time waited was shown as high in AWR report and it was shown as one of the top timed events (even without sleep).

Version 11.2.0.2.2 PST

This version had a drastic change in behavior of mutex. Oracle came up with a exhaustive structure for controlling mutex called – mutex waits schemes.

Oracle introduced several underscore parameter to control behavior of mutex spin, yield and sleep.

Following underscore parameters were introduced starting this version

_mutex_wait_scheme – this parameter was introduced with 2 different values for 3 different type of mutex wait configurations.

0 - Always yield
1 - Always sleep for _mutex_wait_time
2 - Exponential back off with max sleep time of _mutex_wait_time

_mutex_spin_count – Number of times the process should spin. Default 255
_mutex_wait_time – amount of time process should sleep. Default 1ms

Along with this parameter, Oracle introduced following parameter to control yield and sleep of mutex

_wait_yield_mode – this defines if the process should yield first or sleep first. possible values are “yield” (default) or “sleep”

_wait_yield_hp_mode – defines high priority processes> default is SMON and VKTM

_wait_yield_sleep_time_msecs – defines how much a process should sleep before it yields again (in millisec). Default 1

_wait_yield_sleep_freq – number of yield cycles to do before it goes to sleep. default 100

_wait_yield_yield_freq – number of sleep cycles to do before it yields. default 20

_mutex_wait_scheme=0

I think in this scheme _mutex_wait_time is not applicable. Becuase sleep time depends on _wait_yield_sleep_freq. But I am not very sure.

With _mutex_wait_scheme=0 and default values for above _wait_yield* parameters above, mutex acquisition will work following way

- Immediate mutex gets
      - spin - _mutex_spin_count times
            - yield() - 100 times
                  - sleep - 20 times (20 cycles of 1 ms each)
            - yield() - 100 times
                  - sleep - 20 times (20 cycles of 1 ms each)

Using above underscore parameter we can vary the behaviour. Example if we change “_wait_yield_mode” to “sleep” instead of “yield”, oracle process will first sleep

- Immediate mutex gets
      - spin 
            - sleep - 20 times (20 cycles of 1 ms each)
      - yield() - 100 times
            - sleep - 20 times (20 cycles of 1 ms each)
      - yield() - 100 times

_mutex_wait_scheme=1

In this scheme _mutex_wait_time comes into picture. This is “always sleep” mode. So process goes to sleep after first spin, but wakes up after timeout of every _mutex_wait_time

So behavior will look like following

- Immediate mutex gets
      - Spin for _mutex_spin_count
            - sleep for _mutex_wait_time
            - sleep for _mutex_wait_time
            - sleep for _mutex_wait_time
            - sleep for _mutex_wait_time
            ...
            ...

_mutex_wait_scheme=2

In this scheme _mutex_wait_time defines the max time a process should wait. This is “exponential backoff” scheme. Wait time increases exponentially until it reaches a max value set by _mutex_wait_time.

Other than this behavior differs in initial spin and yield cycle. It spins and yield 2 times initially before the sleep begins

So behavior will look like following

- Immediate mutex gets
      - Spin for _mutex_spin_count - 2 times
            - yield() - 2 times
                  - sleep - with exponential increase of timeouts

Starting from 11.2.0.2.2, _mutex_wait_scheme=2 is the default behavior

Hope this helps !!

References:

http://blog.tanelpoder.com/2008/08/03/library-cache-latches-gone-in-oracle-11g/

http://oracleinaction.com/latche-lock-pin-mutex/

https://andreynikolaev.wordpress.com

Latches: What do we know ?

Latches are low level serialization mechanism which protects memory areas inside SGA. They are light wait and less sophesticated than enqueues and can be acquired and released very quickly.

Latch acquisition does not involve any complex algorithm and is based on test-and-set atomic instruction of a computer processor.

Latch Classification:

Latch can be classified in multiple ways:-

Shared latch and exclusive latch:

Shared latch is the one which can be shared by multiple processes/sessions.

Example if a session wants to read a block in memory and at the same time other session also wants to read the same block in memory, they can acquire shared latch. Example of shared latch is “latch: cache buffer chain”. This latch was exclusive in older version and Oracle changed this to shared latch from Oracle 9i onwards.

Exclusive latch is the one which is acquired when a session wants to make modification to block or memory area. Exclusive latch can be held by only 1 session/process at a time and is not compatible with any other latch.

Both shared latch and exclusive latch are not compatible with each other. Process holding a shared latch will block other process which needs exclusive latch but will allow other process which needs shared latch on same memory area. Similarly process holding exclusive latch will not allow any other process which needs either shared latch or exclusive latch.

Example of exclusive latch is “library cache latches” in previous version. These were taken in exclusive mode even for traversing a hash bucket. These latches are no more present in 11g and they are replaced by mutex.

As per Andrey Nikolaev 460 out of 551 latches are exclusive in Oracle 11.2.0.2. In each new version, Oracle tries to make latches more sharable in order to reduce contention.

Another classification of latches is immedaite latch and willing to wait latch

Immedaite latch and willing to wait latch:

Immediate latches are the one that session try to acquire immediately without waits. If the latch is not available, session will not wait and may get terminated or check for another latch. Example of immediate latch is redo copy latch. This is immediate latch because Oracle only wants to know if anyone else is currently copying redo data to log buffer, but not who exactly is copying and where, as this does not matter to LGWR.

Willing to wait latches are the one which will wait if the latch is not available. These latches have little complex behavior then immediate latches. Most of the latches in oracle are willing to wait latches

Why do we see wait events on shared latches ?

After reading about shared latches and exclusive latches, one question comes to mind regarding shared latches.

If “latch: cache buffer chain” latch is a shared latch (past 9i version) and they dont block other shared latches, why do we see “latch: cache buffer chain” latch wait events even in latest version of Oracle ?

The answer to this is explained by Andrey Nikolaev. Andrey has given a very practical example of why this happens. When multiple processes are trying to acquire shared latch (example CBC latch), we should not see any wait events or process blocking each other. The moment another session try to acquire exclusive lock on the resource (where other sessions were having shared latch), exclusive latches are given higher preference than shared latch. So session with exclusive latch will get access to resource and it will block all other sessions having shared latch. This will form a chain of sessions willing to acquire shared latch. Strange part of this algorithm is that even after exclusive latch has been released by the processing holding it, other processes cannot acquire shared latch concurrently. They still continue to be in latch queue and they get access to shared latch one by one. First process in queue will get shared latch and once its done with the access, it will release the latch and post the next process that latch is available. This is one of the major reason why we see wait events on shared latches.

Process flow for latch acquisition

Following is the process flow for acquiring a latch

 - Immediate latch gets 
       - spin latch gets
             - Add process to queue of latch waits
                   - Sleep until posted

The last step in process flow (sleep until posted) is a changed behavior. Initially until 9i, Oracle used to wake up periodically to check if latch has been freed or not and if latch is still not avaialble go back to sleep. So process flow in 9i used to be following

 - Immediate latch gets 
       - spin latch gets
             - Add process to queue of latch waits
                   - Sleep for fixed time and wake up
             - immediate latch gets
                   - Sleep for fixed time and wake up
             - immediate latch gets
                   - Sleep for fixed time and wake up

This behavior has been change from 10g onwards and holding process will wakeup the waiter process after the latch becomes free. This has been explain well by Tannel Podder. Also behavior is explained in more details by ALEX FATKULIN.

Some times, holding process was not able to wake up the process waiting for latch because of bugs or lower kernel version. So older version of Oracle used to have some default timeout available so that in case holding process miss the wakeup call, it will not wait indefinitely and will wakeup after timeout. Oracle has introduced a parameter _enable_reliable_latch_waits={true|false}, which alters this behavior. If this is set to true then no timeout is added and holding process continue to sleep until it gets posted by holding process. False represents the behavior otherwise.

Only latch which is exception to above process flow is “process allocation” latch. This latch does not depends on holding process to post when latch is free. It wakes up preriodically to acquire latch.

Latch Behavior

Oracle has introduced few parameters to control behavior of latch. We will discuss brief about those parameters and how they affect shared latch and exclusive latches

When process try to acquire latch, it attempts to acquire latch in immediate mode. If latch is not available in immediate mode, it will spin for defined number of times and try again. If still latch is not free, process will go to sleep.

Process Spin:

Spinning is a process of consuming/burning CPU so that process will stay “ON CPU” and not to get descheduled from CPU cycles. During spinning, process will burn CPU for few micro seconds with the hope that after passing that much time, latch will be available for it to acquire. This does increase CPU consumption, but it saves time as it may avoid the needing for the process to sleep (which is expensive as it involves context switches).

Number of times a process spins to acquire latch depends on type of latch.

Exclusive latch:

For exclusive latch, this “can be” controlled by _spin_count but needs database bounce. I said “can be” because exclusive latch spins ar actually decided by “spin” column in x$ksllclass table.


SQL>select indx,spin from x$ksllclass;

INDX SPIN
---------- ----------
0 20000
1 20000
2 20000
3 20000
4 20000
5 20000
6 20000
7 20000

8 rows selected.

There are 8 classes of latches (as indicated by indx column) that we will discuss later in this article. By default all latches belongs to class 0.
If we want to change spin count for exclusive latches, we need to change value of SPIN column for class 0. This can be done by changing _spin_count and bouncing the instance (but that will change spin count for all classes), or by setting _latch_class_0 parameter (which will change spin count for only class 0). We have similar parameter to change spin count for other classes (_latch_class_[0-7]).

So changing _spin_count is not a good idea. Instead we can move a specific latch for which we want to change the spin count to another class (1-7) and change specific underscore parameter (_latch_class_[1-7]).

By details _spin_count parameter is not applicable to exclusive latches as default value of _spin_count is 2000 where as exclusive latch spin for 20000 times as mentioned in above table x$ksllclass.
But changing _spin_count will make it applicable for exclusive latch as well.

Shared Latch:

For shared latch, number of times process spins is _spin_count * 2. This has been proved by Andrey Nikolaev. Also, _spin_count parameter is applicable to shared latches by default. So since default value of _spin_count is 2000, shared latches spins 4000 times (2000 * 2).

Diagnosing latch contention:

I am not going to mention much here because Tannel Podder has already written great script – latchprof.sql and latchprofx.sql which can be used to analyze latch wait events.

Tannel has also written great article on how to diagnose latch wait events – http://tech.e2sn.com/oracle/troubleshooting/latch-contention-troubleshooting

In next article, I will try to cover mutex.

Hope this helps !!

Reference:

https://andreynikolaev.wordpress.com

http://tech.e2sn.com/oracle/troubleshooting/latch-contention-troubleshooting

Brief about Workload Management in Oracle RAC

This is a brief article about workload management in RAC. I tried to cover different components of workload management in RAC and how they are configured at client side or server side. I haven’t gone into details of configuration steps but just mentioned in brief about how it can be done.

Readers are advised to refer Oracle documentation to understand details about configuration of workload management.

Workload management on RAC

There are 2 major components of workload management:

  1. Failover – if connection to one instance fails, Oracle should failover the connection to another instance
  2. Load Balancing – Workload on RAC instances should be distributed equally

Failover can be connect time or run time. Similarly load balancing can be achieved during connect time or run time.

We can also configure these components either on client side or on server side.

I tried to put workload management in RAC in single block diagram to get high level overview. Following figures gives a summary of configuring workload management in RAC.

workload

Lets check how to configure each of these components.

Failover

We can achieve failover at connect time or at run time. So depending on how we want to achieve failover (during connect time or run time), we can configure the same on client side or on server side

Connect time failover, Client side

Connect time failover can be configured on client side as connection failover happens if instance is down before it can even get connected. So logically its not possible to have it on server side (because that will need connection to complete and in that case it wont be connect time failover).

This is achieved on client side using FAILOVER=ON parameter in TNS string.
Example:

ORCL=
     (DESCRIPTION=
         (ADDRESS_LIST=
         (FAILOVER=ON)
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node1-vip) (PORT=1521))
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node2-vip) (PORT=1521))
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node3-vip) (PORT=1521))
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node4-vip) (PORT=1521))
     )
    (CONNECT_DATA= (SERVICE_NAME= ORCL))
)

 

Run time failover, Client side

At run time, we can achieve failover using Transparent Application Failover (TAF).
TAF can be configured on client side in TNS string using FAILOVER_MODE parameter.
Example:

ORCL=
     (DESCRIPTION=
         (ADDRESS_LIST=
         (FAILOVER=ON)
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node1-vip) (PORT=1521))
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node2-vip) (PORT=1521))
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node3-vip) (PORT=1521))
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node4-vip) (PORT=1521))
     )
     (CONNECT_DATA= (SERVICE_NAME=ORCL)
     (FAILOVER_MODE=(TYPE=select)(METHOD=basic))
)

 

If you check above TNS string, we have FAILOVER_MODE parameter and it specifies the failover type and method. If FAILOVER_MODE is specified then in case of instance outage, existing connected sessions will automatically failover at run time to other existing instances. TAF has more details then specified here. You can check Oracle documentation or reference links in this article for complete details about TAF.

Run time failover, Server side

Same TAF implementation can be done on server side as well. This is done as part of service management in RAC.
We can use SRVCTL to configure services on RAC add TAF parameters.

[oracle@orcl_node1 ~]$ srvctl add service -d orcl -s test -r orcl1 -P BASIC -e SESSION -m BASIC
[oracle@orcl_node1 ~]$ srvctl start service -d orcl -s test
[oracle@orcl_node1 ~]$ srvctl status service -d orcl -s test
Service test is running on instance(s) orcl1
[oracle@orcl_node1 ~]$ srvctl config service -d orcl -s test
Service name: test
Service is enabled
Server pool: orcl_test
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SESSION
Failover method: BASIC
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: 
Runtime Load Balancing Goal:
TAF policy specification: BASIC
Preferred instances: orcl1
Available instances:
[oracle@orcl_node1 ~]$

We can also use Fast Connection Failover (FCF) using Fast Application Notification (FAN) events in OCI clients to get notifications about instance availability. Based on these notification, clients can reconnect to available instances.

Load Balancing

We can achieve load balancing at connect time or at run time. Depending upon when we want to achieve load balancing (connect time or run time), we can configure load balancing on client side or on server side.

Connect time load balancing, Client side

We can achieve connect time load balancing on client side using LOAD_BALANCE=ON parameter in TNS string.

Example:

ORCL=
     (DESCRIPTION=
         (ADDRESS_LIST=
         (LOAD_BALANCE=ON)
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node1-vip) (PORT=1521))
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node1-vip) (PORT=1521))
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node1-vip) (PORT=1521))
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_node1-vip) (PORT=1521))
     )
    (CONNECT_DATA= (SERVICE_NAME= ORCL)
)

LOAD_BALANCE parameter is set to ON by default and we do not have to specify explicitely. However, putting LOAD_BALANCE=OFF will disable load balancing. Oracle picks ramdom hosts from address list and try to load balance connections to database instances. With 11.2, Oracle introduced SCAN listener which provide host IPs in round robin fashion. So with single SCAN alias in TNS names, connections to different hosts are balanced automatically. This is going to deprecate LOAD_BALANCING parameter in TNS. Example:

ORCL=
     (DESCRIPTION=
         (LOAD_BALANCE=ON)
         (ADDRESS= (PROTOCOL=TCP) (HOST=orcl_scan) (PORT=1521))
     )
     (CONNECT_DATA= (SERVICE_NAME= ORCL)
 )

 

Connect time load balancing, Server side

We can enable server side load balancing using CLB_GOAL service attribute.
Oracle has introduced load balanving advisory in 10g which keeps track of loads on individual instances. Having dynamic registration keeps all listeners aware of load profile of each instances. We need to set remote_listener to tns alias containing all virual IP address of all nodes in cluster. Even with SCAN listener, we need to keep SCAN VIP in remote_listener parameter.

CLB_GOAL stands for connect time load balancing goal. It is used to define expected session duration for the service. For example if we have OLTP service and we expect lots of short sessions which last for very short time (few secs to few mins), then we can set CLB_GOAL for that service as SHORT. If service is expected to serve sessions which are going to be connected for longer duration (few mins to hours), we can set CLB_GOAL to LONG. Setting CLB_GOAL will instruct listener to route connections based on metrics. Possible metrics are load per node (based on CPU run queue) (CLB_GOAL=short) or number of current connections (CLB_GOAL = long).

  • If CLB_GOAL is short then Oracle considers load per node (based on CPU run queue) as metrics and route connection to host where load is less.
  • If CLB_GOAL is long then Oracle considers number of connections to instance as metrics and route connection to host where number of connections are less.

Example:

[oracle@orcl_node1 ~]$ srvctl modify service -d orcl -s test -j SHORT
[oracle@orcl_node1 ~]$ srvctl config service -d orcl -s test
Service name: test
Service is enabled
Server pool: orcl_test
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SESSION
Failover method: BASIC
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: SHORT
Runtime Load Balancing Goal: NONE
TAF policy specification: BASIC
Preferred instances: orcl1
Available instances:
[oracle@orcl_node1 ~]$

Run time load balancing, Server side

We cannot have client side load balancing during run time. This is because when we consider run time load balancing, each transaction needs to be balanced as against each connection.
Load balancing advisory serve as basic for runtime connection load balancing. Using dynamic service registration services are registered with all listeners. PMON of each instance updates the load profile to all listeners. Since listeners knows the load profile of all instances sessions are directed to most appropriate instance depending on the goal of runtime load balancing. Connection allocation is based on current performance level provided by the database instances as indicated by load balancing advisory FAN events. This provides load balancing at the transaction level instead of load balancing at the time of initial connection.

Service level of instances are analyzed based on runtime load balancing goal

  • Service time (internet web processing)
  • Throughput (batch processing)

Above runtime load balancing goals can be set using GOAL parameter of server (dont get confused with CLB_GOAL parameter, which is for connect time load balancing).

We can set this parameter GOAL on server side for each service using SRVCTL.

Once we set this parameter to either GOAL_SERVICE_TIME or GOAL_THROUGHPUT, Oracle will balance the load using following metrics

  • If GOAL_SERVICE_TIME is used, Oracle will check the service time ie. how fast an instance if serving a single transaction. Oracle will have these metrics for each of the instances and connection for a service will be diverted to an instance where service time is best. This is mainly for OLTP transactions.
  • If GOAL_THROUGHPUT is used, Oracle will check the throughput metrics ie. which instance is doing max amount of work in least time and forward the instance which is having best throughput. This is mainly for batch processing.

Example:

[oracle@orcl_node1 ~]$ srvctl modify service -d orcl -s test -B SERVICE_TIME
[oracle@orcl_node1 ~]$ srvctl config service -d orcl -s test
Service name: test
Service is enabled
Server pool: orcl_test
Cardinality: 1
Disconnect: false
Service role: PRIMARY
Management policy: AUTOMATIC
DTP transaction: false
AQ HA notifications: false
Failover type: SESSION
Failover method: BASIC
TAF failover retries: 0
TAF failover delay: 0
Connection Load Balancing Goal: SHORT
Runtime Load Balancing Goal: SERVICE_TIME
TAF policy specification: BASIC
Preferred instances: orcl1
Available instances:

Reference:

http://www.oracle.com/technetwork/database/features/oci/taf-10-133239.pdf

http://docs.oracle.com/cd/B19306_01/rac.102/b14197/hafeats.htm

How many checkpoints in Oracle database ?

This Question came to my mind when I was discussing “direct path reads” with DBA candidates. I was surprised that many DBAs were not aware of object level checkpoint that happens while doing direct path read. So I thought many DBAs may not be knowing different level of checkpoints that happens in the database and hence the question.

Well, the answer is 7 (as per my knowledge). Oracle does 7 different checkpoints at various stages. Lets check out what are those checkpoints and at what point they happen

  1. Full checkpoint
  2. Thread checkpoint
  3. File level checkpoint
  4. Object level checkpoint
  5. Parallel query checkpoint <– BTW. This is the checkpoint that happens during direct path reads
  6. Incremental checkpoint
  7. Checkpoint during log switch

Full Checkpoint

This happens when DBA explicitly issues checkpoint command using “alter system checkpoint”. When this happens, all dirty blocks from (all instances in case of RAC) db_cache are written to disk. This includes committed as well as uncommitted data blocks.

This checkpoint also happens when DBA shutdown database cleanly using “shutdown immediate” or puts entire database in begin backup mode using “alter database begin backup”

Thread Checkpoint

Thread checkpoint is basically a full checkpoint in single instance database. So redo thread is associated with an instance. Thread checkpoints basically write all dirty blocks of a single thread or instance to a database. In case of RAC when a checkpoint is done for a specific single instance, its called thread checkpoint. This is done using “alter system checkpoint local”

File level Checkpoint

File level checkpoint is writing dirty blocks associated with set of files belonging to a tablespace. This happens when we put a tablespace in begin backup mode or when we take a tablespace offline or when we make tablespace readonly. Oracle writes all dirty blocks associated with datafiles of that tablespace to database before it changes the status of that tablespace.

Object level Checkpoint

All the dirty blocks that belong to specific object is written to database by DBWR process. This happens when you perform following action on the object (example table or index):

  • Drop table
  • truncate table
  • drop index
  • drop table purge

You might be thinking if we are dropping an object, why Oracle has to write its blocks to database. 2 reasons

  1. Oracle writes these blocks to database “before” performing above DDL tasks
  2. It’s required for recovery purpose. In future if you have to restore and recover database, Oracle needs to have its previous blocks so that it can roll forward and rollback.

Parallel query checkpoint

Whenever you are reading queries using parallel workers, Oracle does direct path reads and reads data from a datafile directly into PGA bypassing SGA. Starting from 11g, direct path reads also happens for full table scans where table size is larger than _small_table_threshold parameter.

Imagine we have a session which connected prior and did some DML on a table in SGA (buffer cache) and committed the changes. Since checkpoint doesn’t happen as soon as we commit the changes what happens when another session connects immediately and do a full table scan or parallel scan of that table. How will it see the latest data ?

This is where parallel query checkpoint comes in. When you run full table scan or parallel query scan, you will see direct path reads wait event but in the beginning, you will also see enq: KO fast object checkpoint wait event. This will checkpoint any blocks that belong to the object you are doing direct path read so that latest change goes into datafile.

Incremental Checkpoint

Prior to Oracle 8i, Oracle used to do checkpoint during log switch and nothing before that. So during log file switch, Oracle has to write lot of blocks to disk and we will see sudden spike in IO. Also, this has an effect of increasing recovery time if checkpoint hasn’t happen for until we are at the end of log file and database has crashed.

Starting 8i, Oracle started doing incremental checkpoints time to time. This logic has also evolved from Oracle 8i version till now and different parameters control this behavious in latest versions compared to older version.

In prior releases, we used to have log_checkpoint_interval and log_checkpoint_timeout parameters which used to control duration of incremental checkpoints that should happen in order to meet recovery SLAs. In later release, Oracle provided fast_start_mttr_target and fast_start_io_target parameters. These parameters takes our SLA timings and internally Oracle decides how frequently it has to take incremental checkpoints.

We have another situation where Oracle has to go for incremental checkpoint even though above parameters (fast_start_mttr_target or fast_start_io_target) has not met the condition yet. This happens when Oracle is not able to find any free blocks in buffer cache. At this point Oracle has to flush least recently used blocks to datafiles in order to make room for new buffers comming in. By default Oracle scans _db_block_max_scan_pct of blocks before it decides to flush LRU blocks to datafiles. If its not able to find required number of free blocks even after scanning _db_block_max_scan_pct blocks, it will go for incremental checkpoint starting with LRU blocks.

Checkpoint logic is much more complex that what I explained here. But objective of this article was to just introduce  different types of checkpoints so I am not covering details of checkpoint algorithm here.

Checkpoint during log switch

This is the most obvious checkpoint. It happens whenever log switch happens. But note that during log switch only dirty blocks whose information is protected by that log file will be written to datafiles. So not all dirty blocks are written during log switch.

Hope this helps !!

Reference

https://bdrouvot.wordpress.com/2015/04/30/direct-path-read-and-enq-ko-fast-object-checkpoint/