Oracle Real Application Cluster 10g


Oracle Real Application Cluster (RAC) is a revolution in the database management system. It is an extension of Oracle single instance database. RAC is basically a cluster of instances working on the same database. As you know instance is nothing but the computer memory and some background processes, so in case of RAC we have multiple such instances which are installed and configured on different nodes and we have a single database (datafiles) which are accessed by these instances. This post explains the technical details about the RAC architecture and also I will discuss about the installation of RAC.

What is Oracle Real Application Cluster 10g?

Software Architecture

A RAC is a clustered database. A cluster is a group of independent servers that cooperate as a single system. In the event of system failure clustering ensure high availablity to the user. Access to mission critical data is not lost. Redundant hardware components, such as additional nodes, interconnects and disks, allow the cluster to provide high availability. Such redundant hardware architecture avoids a single point of failure and ensures high availability for the system.


Above figure shows the architecture for RAC. In RAC each instance runs on a seperate server which can access database made of multiple disks. For RAC to act as a sngle database, each seperate instance in a RAC should be a part of cluster. For the external users all the instance (nodes) which are part of cluster will look as single instance.

For each instance to be a part of cluster, we need to have some cluster software installed and all the instance should register in the cluster software. From Oracle Database 10g onwards, Oracle provides its own clusterware, A software to be installed on the nodes which are the part of cluster. Advantage with Oracle clusterware is that customer doesn’t have to purchase any third party clusterware. Also the clusterware provided by Oracle is integrated with OUI for easy installation. When a node in a Oracle cluster is started, all instances, listener and services are stared automatically. If an instance fail, the clusterware will automatically restart the instance so the services is often restored before the administrator notices it was down.

Network Architecture

Each RAC node should have at least one static IP address for the public network (Used by application) and one static IP address for the private cluster interconnect. Also we can have 1 virtual IP address(VIP) for each node.

The private networks are critical components of a RAC cluster. The private networks should only be used by Oracle to carry Cluster Manager and Cache Fusion (Explained Later) inter-node connection. A RAC database does not require a separate private network, but using a public network can degrade database performance (high latency, low bandwidth). Therefore the private network should have high-speed NICs (preferably one gigabit or more) and it should only be used by Oracle.

Virtual IPs are required for fail over. This is called TAF (Transparent Application Failover). Processes external to the Oracle 10g RAC cluster control the Transparent Application Failover (TAF). This means that the failover types and methods can be unique for each Oracle Net client. For failover to happen client connections are made using VIPs.

Hardware Architecture

Both Oracle Clusterware and Oracle RAC require access to disks that are shared by each node in the cluster. The shared disks must be configured using OCFS (1 or 2), raw devices or third party cluster file system such as GPFS or Veritas.

OCFS2 is a general-purpose cluster file system that can be used to store Oracle Clusterware files, Oracle RAC database files, Oracle software, or any other types of files normally stored on a standard filesystem such as ext3. This is a significant change from OCFS Release 1, which only supported Oracle Clusterware files and Oracle RAC database files. Note that ASM cannot be used to store the Oracle clusterware files, since clusterware is installed before installaing ASM and also clusterware have to be started before starting ASM.

OCFS2 is available free of charge from Oracle as a set of three RPMs: a kernel module, support tools, and a console. There are different kernel module RPMs for each supported Linux kernel.

Installing RAC 10g

Installing a RAC is a 5 step process as given below.

1) Complete Pre-Installation Task
Hardware Requirement
Software Requirement
Environment Configuration, Kernel parameter and so on.
2) Perform CRS installation
3) Perform Oracle Database 10g Installation
4) Perform Cluster Database creation
5) Complete post installation task

Pre-Installation Task

Check System Requirement

– Atleast 512MB of RAM
Run below command to check
# grep MemTotal /proc/meminfo
– Atleast 1G of swap space
Run below command to check
# grep SwapTotal /proc/meminfo
– /tmp directory should be 400M
Run below command to check
df -h /tmp

Check Software Requirement
– package Requirements

For installing RAC, the packages required for Red Hat 3.0 are:

you can verify if these or higher version packages are present or not using following command
# rpm -q <package_name>

– Create Groups and Users

You can create unix user groups and user IDs using groupadd and useradd commands. We need 1 oracle user and 2 groups – “oinstall” being the primary and “dba” being secondary.

# groupadd -g 500 oinstall
# groupadd -g 501 dba
# useradd -u 500 -d /home/oracle -g “oinstall” -G “dba” -m -s /bin/bash oracle

Configure Kernel Paramters

– Make sure that following parameters are set in /etc/sysctl.conf

kernel.shmall = 2097152
kernel.shmmax = 536870912
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 658576
net.ipv4.ip_local_port_range = 1024 65000
net.core.rmem_default = 262144
net.core.rmem_max = 1048536
net.core.wmem_default = 262144
net.core.wmem_max = 1048536

To load the new setting run /sbin/sysctl –p

These are the minimum required values, you can have higher values as well if your server configuration allows.

Setting the system environment

– Set the user Shell limits

cat >> /etc/security/limits.conf << EOF
oracle soft nproc 2047
oracle hard nproc 16384
oracle soft nofile 1024
oracle hard nofile 65536

cat >> /etc/pam.d/login << EOF
session required /lib/security/

cat >> /etc/profile << EOF
if [ \$USER = “oracle” ]; then
if [ \$SHELL = “/bin/ksh” ]; then
ulimit -p 16384
ulimit -n 65536
ulimit -u 16384 -n 65536
umask 022

cat >> /etc/csh.login << EOF
if ( \$USER == “oracle” ) then
limit maxproc 16384
limit descriptors 65536
umask 022

– Configure the Hangcheck Timer

Hangcheck-timer module monitors the Linux kernal for extended operating system hangs that can affect the reliability of RAC node and cause database corruption. If a hang occurs, the module reboots the node.

You can check if the hangcheck-timer module is loaded by running lsmod command as root user.

/sbin/lsmode | grep -i hang

If the module is not running you can load it manually using below command.

modprobe hangcheck-timer hangcheck_tick=30 hangcheck_margin=180
cat >> /etc/rc.d/rc.local << EOF
modprobe hangcheck-timer hangcheck_tick=30 hangcheck_margin=180

– Configuring /etc/hosts

/etc/hosts contains the hostname and IP address of the server.

You will need 3 hostnames for each node in the cluster. One will be public hostname for primary interface. Second will be private hostname for cluster interconnect. Third will be virtual hostnames (VIP) for high availability.

For Node 1

# Do not remove the following line, or various programs
# that require network functionality will fail. localhost.localdomain localhost ocvmrh2045 #node1 public ocvmrh2045-nfs ocvmrh2045-a #node1 nfs ocvmrh2045-priv #node1 private ocvmrh2053-priv #node2 private ocvmrh2051 # Node1 vip ocvmrh2056 # Node2 vip

For Node 2

# Do not remove the following line, or various programs
# that require network functionality will fail. localhost.localdomain localhost ocvmrh2053 # Node2 Public ocvmrh2053-nfs ocvmrh2053-a # Node2 nfs ocvmrh2045-priv # Node1 Private ocvmrh2053-priv # Node2 Private ocvmrh2051 # Node1 vip ocvmrh2056 # Node2 vip

– Creating database Directories

You have to get the following directories created for you with a write permission for oracle user.

Oracle Base Directories
Oracle Inventory Directory
CRS Home Directory
Oracle Home Directory

In our case the directories are:

Oracle Base Directories – /u01/app/
Oracle Inventory Directory – /u01/app/oraInventory
CRS Home Directory – /u01/app/oracle/product/10.2.0/crs
Oracle Home Directory – /u01/app/oracle/product/10.2.0/db

Configure SSH for User Equivalence

The OUI detects whether the machine on which you are installing RAC is a part of cluster. If its a part of cluster then you have to select the other nodes which are the part of cluster and on which you want to install the patch. But when OUI tries to install the patch on other node while connecting from 1st node, it will ask for login credential and prompt for a password in between the installation, which we want to avoid. For this purpose we have to have user equivelence in place. User equivalence can be achieved by using SSH. First you have ot configure SSH.

Logon as the “oracle” UNIX user account

# su – oracle

If necessary, create the .ssh directory in the “oracle” user’s home directory and set the correct permissions on it:

$ mkdir -p ~/.ssh
$ chmod 700 ~/.ssh

Enter the following command to generate an RSA key pair (public and private key) for version 3 of the SSH protocol:

$ /usr/local/git/bin/ssh-keygen -t rsa

Enter the following command to generate a DSA key pair (public and private key) for version 3 of the SSH protocol:

$ /usr/local/git/bin/ssh-keygen -t dsa

Repeat the above steps for all Oracle RAC nodes in the cluster

Create authorized key file.

$ touch ~/.ssh/authorized_keys
$ cd ~/.ssh
bash-3.00$ ls -lrt *.pub
-rw-r–r– 1 oracle oinstall 399 Nov 20 11:51
-rw-r–r– 1 oracle oinstall 607 Nov 20 11:51

The listing above should show the and public keys created in the previous section

In this step, use SSH to copy the content of the ~/.ssh/ and ~/.ssh/ public key from all Oracle RAC nodes in the cluster to the authorized key file just created (~/.ssh/authorized_keys).

Here node 1 is ocvmrh2045 and node 2 is ocvmrh2053

ssh ocvmrh2045 cat ~/.ssh/ >> ~/.ssh/authorized_keys
ssh ocvmrh2045 cat ~/.ssh/ >> ~/.ssh/authorized_keys
ssh ocvmrh2053 cat ~/.ssh/ >> ~/.ssh/authorized_keys
ssh ocvmrh2053 cat ~/.ssh/ >> ~/.ssh/authorized_keys

Now that we have the entry for all the public keys on both the node in this file, we should copy the file to all the RAC nodes. We done have to do this on all nodes, just copying the file to other nodes will do.

scp ~/.ssh/authorized_keys ocvmrh2053:.ssh/authorized_keys

Set permissions to the authorized file

chmod 600 ~/.ssh/authorized_keys

Establish User Equivalence

Once SSH is configured we can go ahead with configuring user equivalence.

su – oracle

exec /usr/local/git/bin/ssh-agent $SHELL
$ /usr/local/git/bin/ssh-add

Identity added: /home/oracle/.ssh/id_rsa (/home/oracle/.ssh/id_rsa)
Identity added: /home/oracle/.ssh/id_dsa (/home/oracle/.ssh/id_dsa)

– Test Connectivity

Try the below command and it should not ask for the password. It might ask the password for the first time, but after that it should be able to execute the steps without asking for password.

ssh ocvmrh2045 “date;hostname”
ssh ocvmrh2053 “date;hostname”
ssh ocvmrh2045-priv “date;hostname”
ssh ocvmrh2053-priv “date;hostname”

Partitioning the disk

In order to use OCFS2, you need to first partition the unused disk. You can use “/sbin/sfdisk –s” as a root user to check the partitions. We will be creating a single partition to be used by OCFS2. As a root user, run the below command.

# fdisk /dev/sdc
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won’t be recoverable.

The number of cylinders for this disk is set to 1305.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
e extended
p primary partition (1-4)
Partition number (1-4): 1
First cylinder (1-1305, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-1305, default 1305):
Using default value 1305

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

You can verify the new partition now as

# fdisk -l /dev/sdc

Disk /dev/sdc: 10.7 GB, 10737418240 bytes
255 heads, 63 sectors/track, 1305 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdc1 1 1305 10482381 83 Linux

When finished partitioning, run the ‘partprobe‘ command as root on each of the remaining cluster nodes in order to assure that the new partitions are configured.

Configure OCFS2

We will be using OCFS2 here in this installation. OCFS is a cluster file system solution provided by Oracle, which is specially meant for RAC instllation. Once configure the disk for OCFS we can use the same for clusterware files (like OCR – Oracle Cluster Registry file and Voting file), as well as we can use the same disk for database files.

# ocfs2console

Select Cluster -> Configure Nodes

Click on Add on the next window, and enter the Name and IP Address of each node in the cluster.
Note: Use node name to be the same as returned by the ‘hostname’ command


Apply, and Close the window.

After exiting the ocfs2console, you will have a /etc/ocfs2/cluster.conf similar to the following on all nodes. This OCFS2 configuration file should be exactly the same on all of the nodes:

ip_port = 7777
ip_address =
number = 0
name = ocvmrh2045
cluster = ocfs2

ip_port = 7777
ip_address =
number = 1
name = ocvmrh2053
cluster = ocfs2

node_count = 2
name = ocfs2

Configure O2CB to Start on Boot and Adjust O2CB Heartbeat Threshold

You now need to configure the on-boot properties of the O2CB driver so that the cluster stack services will start on each boot. You will also be adjusting the OCFS2 Heartbeat Threshold from its default setting of 7 to 601. All the tasks within this section will need to be performed on both nodes in the cluster as root user.

Set the on-boot properties as follows:

# /etc/init.d/o2cb offline ocfs2
# /etc/init.d/o2cb unload
# /etc/init.d/o2cb configure

Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is loaded on boot. The current values will be shown in brackets (‘[]’). Hitting <ENTER> without typing an answer will keep that current value. Ctrl-C will abort.

Load O2CB driver on boot (y/n) [y]: y
Cluster to start on boot (Enter “none” to clear) [ocfs2]: ocfs2
Specify heartbeat dead threshold (>=7) [7]: 601
Writing O2CB configuration: OK
Loading module “configfs”: OK
Mounting configfs filesystem at /config: OK
Loading module “ocfs2_nodemanager”: OK
Loading module “ocfs2_dlm”: OK
Loading module “ocfs2_dlmfs”: OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting O2CB cluster ocfs2: OK

We can now check again to make sure the settings took place in for the o2cb cluster stack:

Verify that ocfs2 and o2cb are started. Check this on both nodes. As root user:

# chkconfig –list |egrep “ocfs2|o2cb”
ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off
o2cb 0:off 1:off 2:on 3:on 4:on 5:on 6:off

If it doesn’t look like above on both nodes, turn them on by following command as root:

# chkconfig ocfs2 on
# chkconfig o2cb on

Create and format the OCFS2 filesystem on the unused disk partition

As root on each of the cluster nodes, create the mount point directory for the OCFS2 file system.

# mkdir /u03

Run the below command as a root user only on 1 node to create a OCFS2 file system on the unused disk /dev/sdc1, that you partitioned above.

# mkfs.ocfs2 -b 4K -C 32K -N 4 -L /u03 /dev/sdc1
mkfs.ocfs2 1.2.2
Filesystem label=/u03
Block size=4096 (bits=12)
Cluster size=32768 (bits=15)
Volume size=10733944832 (327574 clusters) (2620592 blocks)
11 cluster groups (tail covers 5014 clusters, rest cover 32256 clusters)
Journal size=67108864
Initial number of node slots: 4
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Formatting Journals: done
Writing lost+found: done
mkfs.ocfs2 successful

The meaning of the above command is partition with a volume label of “/u03” (-L /u03), a block size of 4K (-b 4K) and a cluster size of 32K (-C 32K) with 4 node slots (-N 4).

Once OCFS2 filesystem is configured on the disk, you can mount the same.

Mount OCFS2 filesystem on both nodes

Run the below command on all nodes to mount the disk having OCFS2 file system.

# mount -t ocfs2 -L /u03 -o datavolume /u03

You can verify if the disk is mounted correctly or not using below command

# df /u03

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdc1 10482368 268736 10213632 3% /u03

Create the directories for shared files

As root user, run the following commands on node1 only. Since /u03 is on a shared disk, all the files added from one node will be visible on other nodes.

CRS files:

# mkdir /u03/oracrs
# chown oracle:oinstall /u03/oracrs
# chmod 775 /u03/oracrs

Database files:

# mkdir /u03/oradata
# chown oracle:oinstall /u03/oradata
# chmod 775 /u03/oradata

Installaing Oracle Clusterware

Before installing the Oracle RAC database, we need to install Oracle clusterware. Clusterware will create 2 important files. One is the OCR file (Oracle Cluster Registry) and other is Voting file. OCR file is used for registering the nodes involved in RAC installation and to store all the details about those nodes. Voting file is used to get the status of each node after a definite period of time. Each node will register its presence after a definite time into this voting file. This is called heart beat of RAC. If a node goes down, then it wont be able to register its presence in voting file and other instance will come to know. Other instance will then bring up the crashed instance.

Follow the below step to install clusterware.

From the setup directory run the ./runInstaller command

Below are the main screen and the inputs to be given.

Welcome page – Click Next


Specify Inventory Directory and Credentials – Enter the inventory location where it should create inventory


Specify Home Details – Specify the correct location of home. Provide the location for crs home. Note that this location may not be a shared location. This is the location for installing a crs software and not for OCR and voting file.

Product Specific Prerequisite Checks – OUI will perform the required pre-reqs checks. Once done, press Next.


Specify Cluster Configuration – On this screen you need to add all the servers that will be part of RAC installation. Basically this is a push install, where the installation will be pushed to all the nodes we are adding here. So that we “don’t’ have to install CRS again from node 2.

Specify Network Interface Usage – We need at least 1 network to be private and not to be used by application. So make 1 network as private network, so that we can use the same for interconnect.

Specify OCR Location – This is where you will provide the location for OCR file. Remember that this file should be shared and accessible to all the nodes. We know that we have a shared disk /u03. In the above step under “Create the directories for shared files”, we created a “/u03/oracrs” directory. This can be provided here.

Specify Voting Disk Location – On this screen you will provide the location for voting file. You need to provide the shared location here as well. You can provide the same shared location we created in above step.


Summary – Click on Install

You can verify the cluster installation by running olsnodes.

bash-3.00$ /u01/app/oracle/product/10.2.0/crs/bin/olsnodes

Create the RAC Database

You can follow the same steps that you follow for installing the single instance database, only couple of screens are new in this instllation as compared to single instance database installation.


4) Specify Hardware Cluster Installation Mode – Select cluster installation and click on select all to select all the nodes in the cluster. This will propogate the installation in all the nodes.


10) Specify Database Storage Options – In this case if you are not using ASM or Raw devices and using file system then specify the shared location we created above. This is important because all the instance should have access to the datafiles. We are creating multiple instances but we are having single database(database files).

At the end, it will give the summary and you click on install.

Congratulations! Your new Oracle 10g RAC Database is up and ready for use.


Oracle RAC Documentation –

Oracle Technical White Paper May 2005 by Barb Lundhild

Converting a single instance database to RAC –



Oracle RAC 10g – Cache Fusion


This post is about Oracle Cache Fusion technology, which is implemented in Oracle database 10g RAC. We are going to discuss just about cache fusion. You should have the architecture knowledge about RAC. Please check Oracle documentation for understanding Oracle RAC architecture. Also you can visit my previous post about Oracle RAC installation to get some basic information and installation details.

Cache fusion technology was partially implemented in Oracle 8i in OPS (Oracle Parallel Server). Before Oracle 8i the situation was different. If we take a case of multi-instance Oracle Parallel server and if one of the instance ask for a block of data which is currently modified by other instance of same database, then the holding instance needs to write the data to disk so that requesting instance can read the same data. This is called “Disk Ping”. This has greatly effected the performance of the database. With Oracle 8i, partial cache fusion was implemented.

Oracle 8i (Oracle Parallel Server) has a background process called “Block Server Process” which was responsible for cache fusion in Oracle 8i OPS. Following table gives the scenario when cache fusion works in Oracle 8i OPS and scenario where cache fusion was not working. Offcourse these limitations are not present in Oracle 10g RAC.

So when requesting instance ask for a block which is present in holding instance in a read or write mode and if the block is dirtied, then cache fusion used to work and block from cache of holding instance used to get copied to requesting instance. But if block is not dirtied and block is present in holding instance then requesting instance has to read the block from datafile. Also if the block is opened for write in holding instance and other instance wants to update the same block then holding instance have to write the block to disk so that requesting instance can read it.

Concept of cache fusion

Cache Fusion basically is about fusing the memory buffer cache of multiple instance into one single cache. For example if we have 3 instance in a RAC which is using the same datafiles and each instance is having its own memory buffer cache in there own SGA, then cache fusion will make the database behave as if it has a single instance and the total buffer cache is the sum of buffer cache of all the 3 instance. Below figure shows what I mean.

This behavior is possible because of high speed interconnect existing in the cluster between each instance. Each of instance is connected to other instance using a high-speed interconnect. This makes it possible to share the memory between 2 or more servers. Previously only datafile sharing was possible, now because of interconnect, even the cache memory can be shared.

But how this helps? Well, for example if we have a data block in one of the instance and its updating the block and other instance needs the same data block then this data block can be copied from holding instance buffer cache to requesting instance buffer cache using this high-speed interconnect. This high speed interconnect is a private connection made just for sending data blocks and more by instances. External users cannot use this connection. It is this interconnect which makes multiple server behave like a cluster. These servers are bind together using this interconnect.

Moving further, now we know how the cluster is formed and what is the back bone of cluster and what exactly we call “cache fusion”. Next we will see how cache fusion works. But before that we need to discuss few important headings which is very important to understand.

We will discuss following topics before discussing Cache Fusion

  1. Cache Coherency
  2. Multi-Version consistency model
  3. Resource Co-ordination – Synchronization
  4. Global Cache Service (GCS)
  5. Global Enqueue Service
  6. Global Resource Directory
  7. GCS resource modes and roles
  8. Past Images
  9. Block access modes and buffer states

I promise this wont be too heavy. Lets look into the overview of these concepts. I wont be going into the details, just sufficient for you to understand cache fusion.

1) Cache Coherency

If we consider a single instance database, whenever a user queries for data he gets a consistent view of data. For example another user has already read a block of data and changed some rows in buffer cache. If another user want to read the data from same data block then Oracle will make a copy of that data block in buffer cache and apply the undo information present in undo tablespace to get a consistent view of data. This consistent data is then presented to user who wants to read the data. This is called maintaining consistency of data.
Now consider a multi instance system RAC, where a data block might not be present in same instance. A user might be updating data block in some other instance. If data block are already available in local instance then they will be immediately available to the user. if they are present in some other instance with in the cluster, they will be transfered into local buffer cache.
Maintaining the consistency of data blocks in the buffer cache of multiple instance is called “Cache Coherency”.

2) Multi-Version consistency model

Multi version consistency model distinguishes between current version of data block and one or mode read consistent version of data block. The current block is the one which contains all the changes, committed as well as uncommitted. Example a user fired a DML on a data block which is not present in any of the instance. Then this block will be read from disk into buffer cache where the value gets changed. After then user commits and fires another DML on same data block. Now that data block is dirty and contains committed as well as uncommitted changes.
Suppose this data block is requested by another user for reading, then oracle will make a copy and apply undo information and make a Consistent Read “CR” copy of this block and ship it to requesting instance. Thus we have multiple versions of same data blocks, each of them are consistent with respect to the user who requested.
During the course of operation there can be many more version of same data block, each of them consistent with respect to some point in time.

3) Resource Co-ordination – Synchronization

In case of multi instance system such as RAC, where same resources (example data block) are getting used concurrently, effective synchronization is required for maintaining consistency. With in the shared cache, co-ordination of concurrent task is called synchronization. The synchronization provided by Oracle RAC provides a cluster wide concurrency of resource and in turn ensure integrity of shared data. All though there is synchronization within the cache, there is some cost involved for doing the same. If we talk about low level operation of synchronization, its just a data copy operation or data transfer operation.
According to Oracle studies, accessing the block in a local cache is much faster then accessing the block from another instance cache with in the cluster. Because with local cache is the in memory copy and with other instance cache, the data transfer needs to be done over high speed interconnect which is obviously slower then in memory copy. Worst is the copy from disk, which is much slower then above two process. Below graph shows the block access time using these 3 methods.

4) Global Cache Service

Global Cache Service (GCS) is the main component of Oracle Cache Fusion technology. This is represented by background process LMSn. There can be max 10 LMS process for an instance. The main function of GCS is to track the status and location of data blocks. Status of data block means the mode and role of data block (I will explain mode and role further). GCS is the main mechanism by which cache coherency among “multiple cache” is maintained. GCS is also responsible for block transfer between the instances.

5) Global Enqueue Service

Global Enqueue Service (GES) tracks the status of all Oracle enqueuing mechanism. This involves all non-cache fusion intra instance operations. GES performs concurrency control on dictionary cache locks, library cache locks and transactions. If performs this operation for resources that are accessed by more then once instance.
Enqueue services are also present in single instance database. These are responsible for locking the rows on a table using different locking modes. To understand more about enqueues, check Oracle documentation about locking.

6) Global Resource Directory

GES and GCS together maintains Global Resource Directory (GRD). GRD is like a in-memory database which contains details about all the blocks that are present in cache. GRD know what is the location of latest version of block, what is the mode of block, what is the role of block (Mode and role will be discussed shortly) etc. When ever a user ask for any data block GCS gets all the information from GRD. GRD is a distributed resource, meaning that each instance maintain some part of GRD. This distributed nature of GRD is a key to fault tolerance of RAC. GRD is stored in SGA. Typically GRD contains following and more information

  • Data Block Address – This is the address of data block being modified
  • Location of most current version of data block
  • Modes of data block
  • Roles of data block
  • SCN number of data block
  • Image of data block – Could be current image or past image.

7) GCS resource modes and roles

Mode of data block is decided based on whether a resource holder intends to modify the data or read the data. The modes are as follows:

  1. Null (N) Mode: Null mode is the least restrictive mode. It indicates no access rights. It acts as a place holder.
  2. Shared (S) Mode: Shared mode indicate that database block is being read and not modified. However another session can read the data block
  3. Exclusive (X) Mode: Exclusive mode indicate exclusive access to block. Other resource cannot have write over this data block. However it can have consistent read on this datablock.

GCS resources also has roles. Following are the different roles present:

  1. Local: When a data block is first read into the instance from the disk it has a local role. Meaning that only 1 copy of data block exists in the cache. No other instance cache has a copy of this block.
  2. Global: Global role indicates that multiple copy of data block exists in clustered instance. For example a user connected to one of the instance request for a data block. This data block is read from disk into an instance. The role granted is local. If another instance request for same block this block will get copied to the requesting instance and the role becomes global.

This role and mode information is maintained in GRD (Global Resource Directory) by GCS (Global Cache Service).

8) Past Images

Past Image concept was introduced in Oracle 9i to maintain data integrity. In an Oracle database, a typical block is not written to disk immediately after it is dirtied. This is to reduce excessive IO. When the same dirty block is requested by some other instance for write of read purpose, an image of the block is created in owning instance and then the block is shifted to requesting instance. This image copy of the block is called Past Image (PI). In the event of failure Oracle can reconstruct the block by reading PIs. It is also possible to have more then 1 PI of the block, depending on how many times the block was requested in dirty stage.

A past image of the block is different then CR (Consistent read) image. Past image is required to create CR by applying undo data.

9) Block access modes and buffer states

An additional concurrency control concept is the buffer state which is the state of a buffer in the local cache of an instance. The buffer state of a block relates to the access mode of the block. For example, if a buffer state is exclusive current (XCUR), an instance owns the resource in exclusive mode.
To see a buffer’s state, query the “status” column of the V$BH dynamic performance view. The V$BH view provides information about the block access mode and their buffer state names as follows:

  • With a block access mode of NULL the buffer state name is CR — An instance can perform a consistent read of the block. That is, if the instance holds an older version of the data.
  • With a block access mode of S the buffer state name is SCUR — An instance has shared access to the block and can only perform reads.
  • With a block access mode of X the buffer state name is XCUR –An instance has exclusive access to the block and can modify it.
  • With a block access mode of NULL the buffer state name is PI — An instance has made changes to the block but retains copies of it as past images to record its state before changes.

Only the SCUR and PI buffer states are Real Application Clusters-specific. There can be only one copy of any one block buffered in the XCUR state in the cluster database at any time. To perform modifications on a block, a process must assign an XCUR buffer state to the buffer containing the data block.
For example, if another instance requests read access to the most current version of the same block, then Oracle changes the access mode from exclusive to shared, sends a current read version ofthe block to the requesting instance, and keeps a PI buffer if the buffer contained a dirty block.

At this point, the first instance has the current block and the requesting instance also has the current block in shared mode. Therefore, the role of the resource becomes global. There can be multiple shared current (SCUR) versions of this block cached throughout the cluster database at any time.

Block transfer using Cache Fusion

Lets consider a very details example of how the block transfer happens between different instances. For explaininng this example I am assuming a 3 node RAC system and also another assumption is that any DML statement is followed by a commit. So if I say that a user executed update that means user executed update + commit. But there is no checkpoint until the end.

Stage 1) In stage 1 datablock is requested by a user C who is connected to instance 3. So a datablock is read into the buffer cache of instance 3.

select sales_rank from salesman where salesid = 10;

Assume this gives a value of 30. This block is read for the first time and its not present in any other instance. So the role of block is LOCAL and the block is read in SHARED mode. Also there are NO PAST IMAGES. So we describe this stage has instance 3 having SL0 mode (SHARED, LOCAL, 0 PAST IMAGES).

Stage 2) In stage 2 user B issues the same select statement against the salesman table. Instance 2 will need the same block; therefore, the block is shipped from instance 3 to instance 2 via cache fusion interconnect. There is no disk read at this time. Both instances are in SHARED mode (S) and role is LOCAL (L). Here if you see carefully that even though the block is present in more then once instance, still we say that role is local because the block is not yet dirtied. Had the block been dirty and then requested by other instance, then in that case the role will change to global.

Stage 3) In stage 3 user B decides to update the row and commit at instance 2. The new sales rank is 24. At this stage, instance 2 acquires EXCLUSIVE lock for updating the data at instance 2 and SHARED lock from instance 3 is downgraded to NULL lock.

update salesman set sales_rank = 24 where salesid = 10; commit;

So instance 2 is having a mode XL0 (Exclusive, Local with 0 past images) and instance 3 is having a NULL lock, which is just a place holder. Also the role of the block is still LOCAL because the block is dirtied for the first time only on instance 2 and no other instance is having any dirty copy of that. If another instance now tries to update same block the role will change to global.

Stage 4) In stage 4 user A decides to update in instance 1 the same row and hence the same block with salesrank of 40. It finds that block is dirtied in instance 2. Therefore the datablock is shipped to instance 1 from instance 2, however, a PAST IMAGE of the datablock is created on instance 2 and lock mode on instance 2 is downgraded to NULL with a GLOBAL role. Instance 2 now has NG1 (NULL lock with GLOBAL role and 1 PAST IMAGE). At this time instance 1 will have EXCLUSIVE lock with GLOBAL role (XG0).

Stage 5) User C executes a select statement from instance 3 on same row. The datablock from instance 1 being the most recent copy (GRD (Global Resource Directory) knows this information about which instance is having the latest copy of datablock), it is shipped to instance 3. As a result the lock on instance 1 is converted to SHARED GLOBAL with 1 PAST IMAGE. The reason the lock gets changed to SHARED and not NULL is because instance 3 asked for shared lock (for reading data) and not exclusive lock (for updating data). If the instance 3 asked for exclusive lock then the instance 1 would have had NULL lock.

Also the instance 3 will now hold SG0 (SHARED, GLOBAL with 0 PAST IMAGES).

Stage 6) User B issues the same select statement against the salesman table on instance 2. Instance 2 will request for a consistent copy of buffer from another instance, which happens to be the current master.
Therefore instance 1 will ship the block to instance 2, where it will be required with SG1 (SHARED, GLOBAL with 1 PAST IMAGE). So instance 2 mode becomes SG1.

Stage 7) User C on instance C updates the same row. Therefore the instance 3 requires an exclusive lock and instance 1 and instance 2 will be downgraded to NULL lock with GLOBAL role and 1 PAST IMAGE. Instance 3 will have EXCLUSIVE lock, GLOBAL role and with no PAST IMAGES (XG0).

Stage 8) The checkpoint is initiated and a “Write to Disk” takes place at instance 3. As a result previous past images will be discarded (as they are not required for recovery) and instance 3 will hold that block in EXCLUSIVE lock LOCAL role with no PAST IMAGES (XL0).

Further if any instance wants to read or write on the same block then a copy will be again shifted from instance 3.


Oracle 10g Grid and Real Application Cluster – By Madhu Tumma