August 2013 – Persistent Storage Solutions

One of the parameter we configure in physical standby setup is about how much amount of time LGWR on primary should wait for physical standby to respond.

When changes happens on primary side, those redo changes are shipped on physical standby database. If physical standby database is down or if standby server is not reachable, we need to have some time limit on how much time primary should wait for standby to respond (and then move ahead without try to ship redo changes to standby). This limit is defined by Net Timeout parameter.

You can check definition in Oracle docs for the same – http://docs.oracle.com/cd/E11882_01/server.112/e17023/dbpropref.htm#i101032

"The NetTimeout configurable database property specifies the number of seconds the LGWR waits for Oracle Net Services to respond to a LGWR request. It is used to bypass the long connection timeout in TCP."

One of the issue I was seeing is my DG broker was giving following error

Dataguard Configuration...
  Protection Mode: MaxAvailability
  Databases:
    orcl_b - Primary database
      Error: ORA-16825: multiple errors or warnings, including fast-start failover-related errors or warnings, detected for the database
    orcl_a - (*) Physical standby database
      Warning: ORA-16817: unsynchronized fast-start failover configuration
  (*) Fast-Start Failover target

When I checked database info in verbose mode, I saw following

DGMGRL> show database verbose orcl_a

Database - orcl_a

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   1 minute 1 second
  Apply Lag:       3 minutes 7 seconds
  Real Time Query: OFF

This means that even when my DB is in MaxAvailbility mode, I still see lag and standby is not getting in synch with primary.

My broker log file (drc<ORACLE_SID>.log in diagnostic_dest location) was showing following error

08/03/2013 07:51:44
Redo transport problem detected: redo transport for database orcl_a has the following error:
  ORA-16198: Timeout incurred on internal channel during remote archival
Data Guard Broker Status Summary:
  Type                        Name                             Severity  Status
  Configuration               FSF                               Warning  ORA-16607
  Primary Database            orcl_b                              Error  ORA-16825
  Physical Standby Database   orcl_a                            Warning  ORA-16817

Oracle error ORA-16198 represent timeout issue that must be happening while contacting standby site.

When I sanity checked standby, everything was fine. So I checked NET Timeout parameter which define the timeout value when primary should be able to contact standby.

I realized that timeout value is very less on my system.

When you do show database verbose <unique name>, it shows you properties

NetTimeout                      = '4'

In my case it was set to 4, which is very low value.
As soon as I set this value to around 10, everything was back to normal.

There is no standard value for this parameter, but usual value should be between 10-30 depending on the network config you have. Basically primary should be able to contact standby within this timelimit and hear back from standby.

Downside for keeping this value higher is, in case if something goes wrong with your standby, your primary will hang for that much time.

So, in my case if I am setting a value of 10 sec for Net Timeout parameter and something goes wrong with standby, my primary database will keep trying to send redo entry to standby for 10 sec and till that time commit wont happen (if I am in MaxAvailability mode).

So we need to balance out the value of this parameter and make sure we set optimum value.

Hope this helps !!

Reference:

http://docs.oracle.com/cd/E11882_01/server.112/e17023/dbpropref.htm#i101032

Month: August 2013

Effect of Net Timeout Parameter in DG configuration