Coverting MySQL database character set to UTF8

Recently I was engaged to convert the character set for few MySQL databases from latin1 to UTF8 collation utf8_general_ci. This article describes the approach taken for doing the same.

First I will describe various levels at which we can change the character set and collation and then we will see how to convert existing data in a database to required character set.

Backup your database:

Before even thinking about changing character set for your database, take backup of database using whatever backup method that is tested by you – mysqldump, mysql-enterprise-backup, export etc

In case something goes wrong, we can always have data and recreate requried table/database etc.

Make sure your backups and restore methods are proven, meaning that you have sucecssfully done restore of tables/database etc

Setting Character set at various level:

We can see following parameters for character sets

<pre>root [mysql] >show variables like '%character%set%';
+--------------------------+---------------------------------------------------------------+
| Variable_name            | Value                                                         |
+--------------------------+---------------------------------------------------------------+
| character_set_client     | utf8                                                          |
| character_set_connection | utf8                                                          |
| character_set_database   | latin1                                                        |
| character_set_filesystem | binary                                                        |
| character_set_results    | utf8                                                          |
| character_set_server     | latin1                                                        |
| character_set_system     | utf8                                                          |
| character_sets_dir       | /usr/local/mysql-5.6.16-linux-glibc2.5-x86_64/share/charsets/ |
+--------------------------+---------------------------------------------------------------+
8 rows in set (0.00 sec)

character_set_client – The character set for statements that arrive from the client. If client is not setting any character set while connecting, this character set will be used for statements send by client. Else value set by client during connection will override this value

character_set_connection – The character set used for literals that do not have a character set introducer and for number-to-string conversion.

character_set_database – Character set used by default database. This character set will be used whenever we change database on server and if that database does not have any character set defined.

character_set_filesystem – This character set is used to interpret string literals that refer to file names at file system level, such as in the LOAD DATA INFILE and SELECT … INTO OUTFILE statements.

character_set_results – The character set used for returning query results such as result sets. If client has used character set in its connection, then this value will not be used for returning the result.

character_set_server – Character set defined at the server level. Any new database created will used this character set, unless we are defining character set at database level

character_set_system – The character set used by the server for storing metadata infomration. Example – the return values of the USER(), CURRENT_USER(), SESSION_USER(), SYSTEM_USER(), DATABASE(), VERSION() etc will be retuned in character set assigned to this variable.

character_sets_dir – The directory where all character sets are installed.

We can set character set at following level

Server Level:

We can do this by setting parameter character_set_server in our main my.cnf file. But this needs a bounce. Once MySQL server is bounced it will pick new value of this parameter and new character set will be the one we set for this parameter. But this does not change anything in existing data or objects. Only new database creation will take this effect.

Database Level:

We can alter any database on our MySQL server and change the character set to UTF8. We can use following command:-

ALTER DATABASE <db_name> DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Again, this does not affect existing data or objects. This will only take affect for future objects/tables that we create in this database.

Table Level:

We can use alter table command to set the character set for a table.

ALTER TABLE <table_name> CHARACTER SET utf8 COLLATE utf8_unicode_ci;

So, if you have many tables in database, you can use following command to dynamically generate a script which can be used to set character set for all required tables in database

SELECT CONCAT("ALTER TABLE ",TABLE_SCHEMA,".",TABLE_NAME," CHARACTER SET utf8 COLLATE utf8_unicode_ci;") AS alter_sql
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = '<db_name>';

But this will not change existing columns/data in the table. This will take effect only for new columns getting added to the table.

Changing Character set of existing data:

Above steps are required for setting the character set at various level so that future objects and data will be created in UTF8

Now, for changing character set for existing data, we need to change character set for every text columns of every table in database where we want to change it

We can use multiple approaches for converting character set for existing data and percona blog has provided a very good reference for these methods along with advantages and disadvantages –

Here, I would like to highlight difference between 2 methods that can be used to convert character sets

Doing at table level:

Following command can be used to convert character set at table level

ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

This will take care of converting the character set for all columns in table. problem with this approach is that, if you have TEXT columns (TEXT, MEDIUMTEXT, LONGTEXT, TINYTEXT, VARCHAR, ENUM), it can end up changing the data type for these columns. Example of the same is giving in above percona blog where TEXT got converted to MEDIUMTEXT

Best and careful way to convert character set is to do it for each text column separately on each table.

You can use following command to change the character set for column

alter table <table_name> change <column_name> <column_name> CHARACTER SET UTF8;

Example:

alter table P1_TABLE change COL1 COL1 TEXT CHARACTER SET UTF8;

In above command/example, we used column name twice. That is required.

But wait, can above approach convert data correctly to required character set?
It may not. Check this article which describes issues we face when we directly try to convert the character set of the column.

Sometimes directly converting can grabble the data. Best way is to convert to binary equivalent and then convert the data type and character set of the column to required once. Following command can be used

alter table P1_TABLE change COL1 COL1 BLOB;
alter table P1_TABLE change COL1 COL1 TEXT CHARACTER SET UTF8;

Since, my column had the text data type, its equivalent is BLOB. Following are the binary equivalent of various text data types

  • CHAR –> BINARY
  • TEXT –> BLOB
  • TINYTEXT –> TINYBLOB
  • MEDIUMTEXT –> MEDIUMBLOB
  • LONGTEXT –> LONGBLOB
  • VARCHAR() –> VARBINARY() (Use same data length)

Automating character set conversion:

You can create simple script with all required commands using following dynamic SQL

Note, that if CHARACTER_SET_NAME is NULL in COLUMNS table for columns, it means that those columns are numbers or binary or of data types which does not need character set conversion

Following dynamic SQL can be used to create automatic script

select concat("alter table ",TABLE_NAME," change ",COLUMN_NAME," ",COLUMN_NAME," BLOB;",
"alter table ",TABLE_NAME," change ",COLUMN_NAME," ",COLUMN_NAME," ",IF(DATA_TYPE in ('varchar','char'),concat(DATA_TYPE,"(",CHARACTER_MAXIMUM_LENGTH,")"),DATA_TYPE)," CHARACTER SET utf8;")
from information_schema.columns
where TABLE_SCHEMA = '<db_name>'
and DATA_TYPE <> 'ENUM'
and CHARACTER_SET_NAME is not null;

ENUM case is different. You need to specify all ENUM value when you convert back to required data type and character set format. Check wordpress codex blog for more details.

Disclaimer: Please do not run above SQLs directly on production without testing. Make sure you write your own SQL as per your need based on database and objects present in your database.

Hope this helps !!

References:

https://codex.wordpress.org/Converting_Database_Character_Sets

https://www.percona.com/blog/2009/03/17/converting-character-sets/

http://www.bothernomore.com/2008/12/16/character-encoding-hell/

Cassandra 2.0 Architecture

In this article, I will cover various key structures that makes up Cassandra. We will also see what structure resides in memory and what resides on disk.

In next article, I will give an overview of various key components that uses these structure for successfully running Cassandra. Further articles will cover more details about each structure/components in details

Cassandra Node Architecture:

Cassandra is a cluster software. Meaning, it has to be installed/deployed on multiple servers which forms the cluster of Cassandra. In my previous article, I have mentioned how to install Cassandra on single server using CCM tool which simulates Cassandra cluster on single server.

Each server which are part of cluster is called Node. So node is essentially a server which is running Cassandra software and holds some part of data.

Cassandra distributes data on all nodes in cluster. So every node is responsible for owning part of data.
Node architecture of Cassandra looks like below. It forms ring of nodes.

cassandra_architecture2

 

 

 

 

 

 

Structures in Cassandra

Following are the various structure of Cassandra which is present on each nodes of Cassandra (either on memory or on disk):-

  • CommitLog
  • SSTable
  • MemTable
  • RowCache
  • KeyCache
  • SSTableIndex
  • SSTableIndexSumamry
  • BloomFilter
  • Compression offset

Lets have an overview of each of these structures

CommitLog [Disk]:

Commit log is a disk level file which stores log record of every transaction happening in Cassandra on that node. This file is stored at disk level for each node configured in cluster. When ever transaction happens on a node in Cassandra, commit log on disk is updated first with changed data, followed by MemTable in memory. This ensures durability. People who are familiar with Oracle terminology can consider commit log as online redo logs.

MemTable [Memory]:

Memtable is dedicated in-memory cache created for each Cassandra table. It contains recently read/modified data. When ever a data from a table is read from a node, it will first check if latest data is present in MemTable or not. If latest data is not present, it will read data from disk (from SSTable) and cache the same in MemTable. We have separate MemTable for each Cassandra table so there is no blocking of read or write for individual tables. Multiple updates on single column will result in multiple entries in commit log, and single entry in MemTable. It will be flushed to disk, when predefined criteria are met, like maximum size, timeout, or number of mutations.

SSTable [Disk]:

These are on disk tables. Every Cassandra table has a SSTable files created on disk. SSTable comprises of 6 files on disk. All these files represent single SSTable.
Following are the 6 files present on disk for each SSTable
1) Bloom Filter
2) Index
3) Data
4) Index Summary
5) Compression Info
6) Statistics

Data file (# 3 above) contains data from the table.
All other files are explained when we see the respective components below.

RowCache [Memory]:

This is off-heap memory structure on each node which caches complete row in a table if that table has rowCache enabled. We can control enabling/disabling rowCache on a table while creating table or alter table at later point. For every table in Cassandra, we have a parameter “caching” whose valid values are
None – No Caching
KEYS_ONLY – Only key caching
ROWS_ONLY – Only complete row caching
ALL – Both row and key caching
When a requested row is found in memory in rowCache(latest version), Cassandra can skip all the steps to check and retrive row from on disk SSTable. This provides huge performance benefit.

KeyCache [Memory]:

This is on-heap memory structure on each node which contains partition keys and its offsets in SSTable on disk. This helps in reducing disk seeks while reading data from SSTable. This is configurable at table level and can be enabled using KEYS_ONLY or ALL setting of caching variable for a table. So when a read is requested, Cassandra first check if a record is present in row cache (if its enabled for that table). If record is not present in row cache, it goes to bloom filter which tells whether data might exists on SSTable or that it definitely does not exists in SSTable. Based on result from bloom filter, Cassandra checks for keys in key cache and directly gets the offset position of those keys in SSTable on disk.

SSTableIndex [Disk]:

Primary key index on each SSTable is stored on separate file on disk (#2 file above). This index is used for faster lookups in SSTable. Primary key is mandatory for a table in Cassandra so that it can uniquely identify a row in table. Many times primary key is same as partition key based on which data is partitioned and distributed to various nodes in cluster.

Partition Summary [Memory and Disk]:

Partition summary is an off-heap in-memory sampling of partition index to speedup the access to index on disk. Default sampling ratio is 128, meaning that for every 128 records for a index in index file, we have 1 records in partition summary. Each of these records of partition summary will hold key value and offset position in index. So when read requests comes for a record and if its not found in row cache and key cache, it checks for index summary to check offset of that key in index file on disk. Since all index records are not stored in summary, it gets a rough estimate of offset it has to check in index file. This reduces disk seeks.
Partition summary is also stored on disk in a file (#4 file above).
partition summary looks like below

partition_summary

 

 

 

 

 

 

 

 

Bloom Filter[Memory and Disk]:

Bloom filter is a off-heap in-mmeory hash based probabilistic algorithm that is used to test if a specific member is part of set or not. This can give false positive, but it can never give false negative. Meaning that a bloom filter can tell that a record might be present in that table on disk and we may not find that record, but it can never say that record is not present when its actually present on disk. This helps in reducing unnecessary seeks for data which is not present at all.
Bloom filter is also present on disk file (#1 file above) and contains serialized bloom filter for partition keys.

Compression offset maps[Memory and Disk]:

Compression offset maps holds the offset information for compressed blocks. By default all tables in Cassandra are compressed and more the compression ratio larger the compression offset table. When Cassandra needs to look for data, it looks up the in-memory compression offset maps and unpacks the data chunk to get to the columns. Both writes and reads are affected because of the chunks that have to be compressed and uncompressed. Compression offset maps is stored as off-heap component of memory. Its also saved on disk in separate file for each SSTable (#5 file above).

So if we put all above structure together and identify them what all present on disk, on-heap memory and off-heap memory, it will look like below

cass_arch

 

 

 

 

 

 

 

 

 

 

 

Hope this helps !!

Affect of object Statistics on SQL Execution statistics

This is a small article to demonstrate why correct statistics are important and how they affect execution statistics of same plan.
In the past we learned that changing table statistics or index statistics (or rebuilding index) can causes plan for a SQL to change. Because when statistics changes, optimizer will try to generate new plan based on changed statistics.
With 11g, oracle provided baseline to ensure stability in SQL plans. So if you have single baseline enabled for a SQL, you essentially have single plan for that SQL and that plan will not change unless auto tuning job evolve another plan or you manually evolves another plan.

Does it mean that object statistics has no role to play if your plan is fixed ?

Lets have a quick demo. In my below exercise, I will not be changing the plan (as I am using baseline). But we will see that execution statistics such as buffer_gets and disk reads changes as you change object statistics.

I faced this issue on one of our production database where stats were completely inaccurate and when we gathered stats on tables and indexes, execution stats for SQLs changed with same plan.

Setup

I am performing these tests on 11.2.0.4.
Lets create 2 tables T1 and T2. I will use one query on single table to go for FTS and other query joining T1 and T2 and use index.


SQL>create table T1 as select * from dba_tables;

Table created.

SQL>insert into T1 select * from T1;

2804 rows created.

SQL>insert into T1 select * from T1;

5608 rows created.
...
...

SQL>commit;

Commit complete.

SQL>--create index on this table on TABLE_NAME column and gather stats

SQL>create index I_T1_TABLE_NAME on T1(TABLE_NAME);

Index created.

SQL>exec dbms_stats.gather_table_stats(null,'T1');

PL/SQL procedure successfully completed.

Create 2nd table from some other view, lets say dba_tab_statistics


DEO>create table T2 as select * from dba_tab_statistics;

Table created.

DEO>insert into T2 select * from T2;

16289 rows created.

...
...

DEO>commit;

Commit complete.

DEO>--create index on TABLE_NAME column in T2 table

DEO>create index I_T2_TABLE_NAME on T2(TABLE_NAME);

Index created.

DEO>exec dbms_stats.gather_table_stats(null,'T2');

PL/SQL procedure successfully completed.

Following are the table and index level stats values for T1 and T2 and corresponding indexes


DEO>select table_name, num_rows, blocks, avg_row_len from dba_tables where table_name in ('T1','T2');

TABLE_NAME			           NUM_ROWS   BLOCKS     AVG_ROW_LEN
------------------------------ ---------- ---------- -----------
T1				               34783	  3214	     247
T2				               298096	  4351	     99

DEO>select INDEX_NAME, LEAF_BLOCKS, DISTINCT_KEYS, CLUSTERING_FACTOR, NUM_ROWS from dba_indexes where table_name in ('T1','T2');

INDEX_NAME		               LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR NUM_ROWS
------------------------------ ----------- ------------- ----------------- ----------
I_T1_TABLE_NAME 		       360	       1414	         34675	           34783
I_T2_TABLE_NAME 		       1813	       2364	         74610             298096

So when object statistics are accurate and current, I see following execution statistics for following 2 queries

Query 1:

select a.table_name, sum(b.num_rows) from T1 a, T2 b where a.table_name= b.table_name and a.owner = b.owner and a.table_name = :b1 group by a.table_name;


DEO>var b1 varchar2(40);
DEO>exec :b1 := 'ABC';

PL/SQL procedure successfully completed.

DEO>select a.table_name, sum(b.num_rows) from T1 a, T2 b where a.table_name= b.table_name and a.owner = b.owner and a.table_name = :b1 group by a.table_name;

TABLE_NAME		               SUM(B.NUM_ROWS)
------------------------------ ---------------
ABC				               240678640


Execution Plan
----------------------------------------------------------
Plan hash value: 1483037242

--------------------------------------------------------------------------------------------------------
| Id  | Operation			                 | Name 	       | Rows  | Bytes | Cost (%CPU)| Time	   |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		             |		           |	 1 |	61 |	 4  (50)| 00:00:01 |
|   1 |  SORT GROUP BY NOSORT		         |		           |	 1 |	61 |	 4  (50)| 00:00:01 |
|   2 |   MERGE JOIN			             |		           |	47 |  2867 |	 4  (50)| 00:00:01 |
|   3 |    SORT JOIN			             |		           |	11 |   396 |	 2  (50)| 00:00:01 |
|   4 |     VIEW			                 | VW_GBC_5	       |	11 |   396 |	 2  (50)| 00:00:01 |
|   5 |      HASH GROUP BY		             |		           |	11 |   297 |	 2  (50)| 00:00:01 |
|   6 |       TABLE ACCESS BY INDEX ROWID    | T2		       |   126 |  3402 |	 1   (0)| 00:00:01 |
|*  7 |        INDEX RANGE SCAN 	         | I_T2_TABLE_NAME |   126 |	   |	 1   (0)| 00:00:01 |
|*  8 |    SORT JOIN			             |		           |	25 |   625 |	 2  (50)| 00:00:01 |
|   9 |     TABLE ACCESS BY INDEX ROWID      | T1		       |	25 |   625 |	 1   (0)| 00:00:01 |
|* 10 |      INDEX RANGE SCAN		         | I_T1_TABLE_NAME |	25 |	   |	 1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   7 - access("B"."TABLE_NAME"=:B1)
   8 - access("A"."TABLE_NAME"="ITEM_2" AND "A"."OWNER"="ITEM_1")
       filter("A"."OWNER"="ITEM_1" AND "A"."TABLE_NAME"="ITEM_2")
  10 - access("A"."TABLE_NAME"=:B1)

Note
-----
   - SQL plan baseline "SQL_PLAN_gt8npsxnncgm2abc84fa9" used for this statement


Statistics
----------------------------------------------------------
	  1  recursive calls
	  0  db block gets
	157  consistent gets
	  3  physical reads
	  0  redo size
	598  bytes sent via SQL*Net to client
	453  bytes received via SQL*Net from client
	  2  SQL*Net roundtrips to/from client
	  2  sorts (memory)
	  0  sorts (disk)
	  1  rows processed

Above query has done 157 consistent gets and 3 physical reads.

Query 2:

select count(1) from T2 where owner = :b2;


DEO>var b2 varchar2(40);
DEO>exec :b2 := 'SYS';

PL/SQL procedure successfully completed.

DEO>select count(1) from T2 where owner = :b2;

  COUNT(1)
----------
    251088


Execution Plan
----------------------------------------------------------
Plan hash value: 3321871023

---------------------------------------------------------------------------
| Id  | Operation	       | Name | Rows  | Bytes | Cost (%CPU)| Time	  |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	  |	1     |	5     |   625   (2)| 00:00:03 |
|   1 |  SORT AGGREGATE    |	  |	1     |	5     |	           |	      |
|*  2 |   TABLE ACCESS FULL| T2   | 19873 | 99365 |   625   (2)| 00:00:03 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"=:B2)

Note
-----
   - SQL plan baseline "SQL_PLAN_9wq67xmrafzda1c6cf506" used for this statement


Statistics
----------------------------------------------------------
	  1  recursive calls
	  0  db block gets
   4350  consistent gets
	  0  physical reads
	  0  redo size
	517  bytes sent via SQL*Net to client
	453  bytes received via SQL*Net from client
	  2  SQL*Net roundtrips to/from client
	  0  sorts (memory)
	  0  sorts (disk)
	  1  rows processed

Above query has done 4350 consistent gets and no physical reads.

Lets fake statistics to random high value and see the affect. Note that both queries above are using baseline, so plan will not change with further executions even after changing object stats

I am going to set table stats and index stats as follows


DEO>exec dbms_stats.SET_TABLE_STATS(null,'T1',NUMROWS=>3221020,NUMBLKS=>202590, AVGRLEN=>150)

PL/SQL procedure successfully completed.

DEO>exec dbms_stats.SET_INDEX_STATS(null,'I_T1_TABLE_NAME',NUMROWS=>3153430,NUMLBLKS=>124232,NUMDIST=>6,AVGLBLK=>2990,AVGDBLK=>19002,CLSTFCT=>76010)

PL/SQL procedure successfully completed.

DEO>exec dbms_stats.SET_TABLE_STATS(null,'T2',NUMROWS=>140000000,NUMBLKS=>13540202, AVGRLEN=>120)

PL/SQL procedure successfully completed.

DEO>exec dbms_stats.SET_INDEX_STATS(null,'I_T2_TABLE_NAME',NUMROWS=>13345304,NUMLBLKS=>1242022,NUMDIST=>12,AVGLBLK=>324,AVGDBLK=>1342,CLSTFCT=>260123)

PL/SQL procedure successfully completed.

DEO>select table_name, num_rows, blocks, avg_row_len from dba_tables where table_name in ('T1','T2');

TABLE_NAME			           NUM_ROWS   BLOCKS     AVG_ROW_LEN
------------------------------ ---------- ---------- -----------
T1				               3221020    202590	 150
T2				               140000000  13540202	 120

DEO>select INDEX_NAME, LEAF_BLOCKS, DISTINCT_KEYS, CLUSTERING_FACTOR, NUM_ROWS from dba_indexes where table_name in ('T1','T2');

INDEX_NAME		               LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR NUM_ROWS
------------------------------ ----------- ------------- ----------------- ----------
I_T1_TABLE_NAME 		       124232	   6	         76010             3153430
I_T2_TABLE_NAME 		       1242022	   12	         260123            13345304


I will purge those queries from shared_pool and run those same queries again


DEO>!cat purgesql.sql
accept sql_id prompt 'Enter SQL_ID:- '
DECLARE
 name varchar2(50);
BEGIN
 select distinct address||','||hash_value into name
 from v$sqlarea
 where sql_id like '&sql_id';

 sys.dbms_shared_pool.purge(name,'C',65);

END;
/

DEO>@purgesql
Enter SQL_ID:- 46j56265bmw7u

PL/SQL procedure successfully completed.

DEO>@purgesql
Enter SQL_ID:- gkbtfpmvxw4hn

PL/SQL procedure successfully completed.

Lets run the queries again after changing the stats

Query 1:

select a.table_name, sum(b.num_rows) from T1 a, T2 b where a.table_name= b.table_name and a.owner = b.owner and a.table_name = :b1 group by a.table_name;


DEO>set autotrace on
DEO>select a.table_name, sum(b.num_rows) from T1 a, T2 b where a.table_name= b.table_name and a.owner = b.owner and a.table_name = :b1 group by a.table_name;

TABLE_NAME		               SUM(B.NUM_ROWS)
------------------------------ ---------------
ABC				               240678640


Execution Plan
----------------------------------------------------------
Plan hash value: 1483037242

----------------------------------------------------------------------------------------------------
| Id  | Operation			              | Name 	        | Rows  | Bytes | Cost (%CPU)| Time	   |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		          |		            |	 1  |	61  |	74   (9)| 00:00:01 |
|   1 |  SORT GROUP BY NOSORT		      |		            |	 1  |	61  |	74   (9)| 00:00:01 |
|   2 |   MERGE JOIN			          |		            |  4464 |   265K|	74   (9)| 00:00:01 |
|   3 |    SORT JOIN			          |		            |	11  |   396 |	71   (6)| 00:00:01 |
|   4 |     VIEW			              | VW_GBC_5	    |	11  |   396 |	71   (6)| 00:00:01 |
|   5 |      HASH GROUP BY		          |		            |	11  |   297 |	71   (6)| 00:00:01 |
|   6 |       TABLE ACCESS BY INDEX ROWID | T2		        | 59222 |  1561K|	67   (0)| 00:00:01 |
|*  7 |        INDEX RANGE SCAN 	      | I_T2_TABLE_NAME | 59222 |	    |	55   (0)| 00:00:01 |
|*  8 |    SORT JOIN			          |		            |  2278 | 56950 |	 3  (67)| 00:00:01 |
|   9 |     TABLE ACCESS BY INDEX ROWID   | T1		        |  2278 | 56950 |	 1   (0)| 00:00:01 |
|* 10 |      INDEX RANGE SCAN		      | I_T1_TABLE_NAME |  2278 |	    |	 1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   7 - access("B"."TABLE_NAME"=:B1)
   8 - access("A"."TABLE_NAME"="ITEM_2" AND "A"."OWNER"="ITEM_1")
       filter("A"."OWNER"="ITEM_1" AND "A"."TABLE_NAME"="ITEM_2")
  10 - access("A"."TABLE_NAME"=:B1)

Note
-----
   - SQL plan baseline "SQL_PLAN_gt8npsxnncgm2abc84fa9" used for this statement


Statistics
----------------------------------------------------------
	 26  recursive calls
	 62  db block gets
	231  consistent gets
	  4  physical reads
  14528  redo size
	598  bytes sent via SQL*Net to client
	453  bytes received via SQL*Net from client
	  2  SQL*Net roundtrips to/from client
	  2  sorts (memory)
	  0  sorts (disk)
	  1  rows processed

Note that plan didn’t change, but we see that consistent gets (buffer_gets/exec) has increased from previous value of 157 to 231.
Physical reads are more or less same (just increase of 1).

Lets check 2nd query doing FTS

Query 2:

select count(1) from T2 where owner = :b2;


DEO>set  autotrace on
DEO>select count(1) from T2 where owner = :b2;

  COUNT(1)
----------
      9600

Execution Plan
----------------------------------------------------------
Plan hash value: 3321871023

---------------------------------------------------------------------------
| Id  | Operation	       | Name | Rows  | Bytes | Cost (%CPU)| Time	  |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	  |	1     |	5     |  1926K  (1)| 02:10:54 |
|   1 |  SORT AGGREGATE    |	  |	1     |	5     |	           |	      |
|*  2 |   TABLE ACCESS FULL| T2   |  9333K|    44M|  1926K  (1)| 02:10:54 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"=:B2)

Note
-----
   - SQL plan baseline "SQL_PLAN_9wq67xmrafzda1c6cf506" used for this statement


Statistics
----------------------------------------------------------
	  1  recursive calls
	  0  db block gets
   8195  consistent gets
   4342  physical reads
	  0  redo size
	515  bytes sent via SQL*Net to client
	453  bytes received via SQL*Net from client
	  2  SQL*Net roundtrips to/from client
	  0  sorts (memory)
	  0  sorts (disk)
	  1  rows processed


Huge increase in consistent gets compared to previous run. Previous run showed consistent gets as 4350 when stats were accurate. With increased stats, consistent gets became 8195
Also, there was no disk reads previously may be because all blocks were in buffer. But with changed object level stats, we are seeing 4342 disk reads.

So always make sure you have latest accurate statistics available for your tables and indexes in your database. You may have right plans, but having bad object stats can cause Oracle to work more.

Hope this helps !!

Tracing Single SQL in Oracle

Many times, while doing SQL tuning, we want to trace (event 10046) single SQL in database. Instead of going for module level tracing or session level using DBMS_MONITOR, we can simply use below alter system command to trace specific SQL

Example: I have a table T1 and index T_I_TABLE_NAME on that table.
I am running following SQL and I want to trace on this SQL

select * from T1 where table_name = 'SINGLE_PRODUCT_GROUPS';

I can just find out the SQL ID of above SQL

select sql_id, sql_text from v$sql where sql_text like '%SINGLE_PRODUCT_GROUPS%'

SQL_ID	      SQL_TEXT
------------- --------------------------------------------------------------------------------
8kybysnu4nn34 select * from T1 where table_name = 'SINGLE_PRODUCT_GROUPS'

Once I have the SQL ID, I can use below alter system to trace this SQL


alter system set events 'sql_trace[SQL:8kybysnu4nn34] plan_stat=all_executions,wait=true,bind=true';

Note that even though this is alter system, other SQL IDs run will not have any effect on tracing because we are tracing only specific SQL ID. So unless SQL ID 8kybysnu4nn34 is run, it will not generate any trace file.

Once SQL ID is run and trace is generated, you can turn off tracing using following statement


alter system set events 'sql_trace[SQL:8kybysnu4nn34] off';

This might generate multiple trace files as multiple sessions might run same SQL ID (depending on the application).

Hope this helps !!!

direct path read behavior in Oracle 11.2

Prior to 11g, whenever optimizer goes for full table scan, Oracle used to show “db file scattered read” as wait event. But starting from 11g, a full table scan can show (depending on certain conditions) “direct path read” wait event.

db file scattered read – happens when blocks for a table is read from datafile into buffer cache in SGA

direct path read – happens when blocks for a table is read from datafile into PGA

Problem with fetching data in db cache was, if table is big enough it will use many blocks in db cache causing performance issue. And if multiple SQLs are doing FTS (full table scan) then every session will try to use those blocks in buffer cache by acquiring the latch and can cause “cache buffer chain: latch” contention. This will spike CPU on the host.

With direct path read, block read from datafile goes to PGA and every session has its own chunk of memory allocated in PGA. So multiple sessions doing FTS on same table does not make situation as bad as it used to be with “db file scattered read”

There are certain conditions based on which oracle decides whether to go for direct path read or use buffer cache.
In the reference section, I have mentioned many good blog articles which discusses one or more of these behaviors.

_small_table_threshold

_small_table_threshold : This is a hidden parameter which helps server process decide whether to go for direct path read or buffer cache. Unit of this parameter is number of blocks and default value is 2% of buffer cache size.

DEO>@param
Enter Parameter name: _small_table

Parameter			                  Description						                           Session Va Instance V
------------------------------------- ------------------------------------------------------------ ---------- ----------
_small_table_threshold		          lower threshold level of table size for direct reads	       27107      27107

In my case this is set to 27107 blocks which is approximately 2% of my db_cache_size blocks (you need to convert db_cache_size in terms of Oracle blocks and take 2% of that value to arrive at default value).

If size of table (blocks below high water mark) is > _small_table_threshold then Oracle chooses to go for direct path read.
In many blog articles, its stated that if size of table is > 5 times _small_table_threshold then it goes for direct path read.

During my testing, I was not able to verify 5 times _small_table_threshold. Because as soon as table size if more than _small_table_threshold, it goes for direct path read. I have checked this behavior in 11.2.0.3 as well as in 11.2.0.4

Lets test the dependency of direct path read on this hidden parameter for partitioned as well as non-partitioned table

Non-Partitoned table

My Oracle version is 11.2.0.4. I have a non-partitioned table T1 created from DBA_OBJECTS. You can create any test table of significant size using any view in database or by using “CONNECT BY rownum<=” clause from dual. Gather stats on table.


DEO>select table_name, num_rows, blocks from dba_tables where table_name = 'T1';

TABLE_NAME			           NUM_ROWS   BLOCKS
------------------------------ ---------- ----------
T1				               91200	    3214

My table has 3214 blocks.

Setting _small_table_threshold to 5000


DEO>alter session set "_small_table_threshold" = 5000;

Session altered.

Since table size is < _small_table_threshold, it should not go for direct path read. Instead we should see blocks going into db cache.
We can check which object’s blocks are placed in buffer cache using X$KCBOQH table. So when full table scan goes into db cache, we should be able to see the same using this table.

Lets flush the buffer cache to make sure nothing is present in buffer cache.


DEO>alter system flush buffer_cache;

System altered.

DEO>select count(*) from T1;

  COUNT(*)
----------
     91200

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1') group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93840	   3209

As shown above FTS has read all buffers in db cache.

Now, lets reduce _small_table_threshold to 3000 and flush buffer cache again.

DEO>alter session set "_small_table_threshold"=3000;

Session altered.

DEO>alter system flush buffer_cache;

System altered.

Since table size > _small_table_threshold, we should see direct path read.
We can verify if a session has done direct path read using statistics “physical reads direct” from v$sesstat
Following query shows the same. In this case my SID is 6531.

DEO>r
  1* select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    2

So current value of direct path read stats is 2 before reading from table. Lets read data from table


DEO>select count(*) from T1;

  COUNT(*)
----------
     91200

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    3210

We see that if table size is more than _small_table_threshold it has gone for direct path read.
We can also confirm if db cache has anything related to our table T1. Remember than we flushed buffer cache before running last query on T1 table.

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1') group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93840	      1

We can see 1 block in db cache which is the header block. Header block is always read in buffer cache.

Partitioned Table

In case of partitioned table same behavior was observed but _small_table_threshold is compared with partitions we are selecting instead of entire table.

I have created similar partitioned table with 4 partitions


CREATE TABLE T1_PART
(
    col1 NUMBER,
    col2 VARCHAR2(100)
 )
 PARTITION BY LIST (col1)
 (
    PARTITION PART_1  VALUES (1)
  , PARTITION PART_2  VALUES (2)
  , PARTITION PART_3  VALUES (3)
  , PARTITION PART_4  VALUES (4)
 ) ;

INSERT INTO T1_PART (SELECT mod(rownum,4)+1, rpad('X',100,'X') FROM dual CONNECT BY rownum<=896000);

DEO>select table_name, partition_name, num_rows, blocks, avg_row_len from user_tab_partitions where table_name='T1_PART';

TABLE_NAME		       PARTITION_NAME			              NUM_ROWS   BLOCKS     AVG_ROW_LEN
-------------------    -------------------------------------- ---------- ---------- -----------
T1_PART 		       PART_1				                  128000      1948	    104
T1_PART 		       PART_2				                  128000      1948	    104
T1_PART 		       PART_3				                  128000      1948	    104
T1_PART 		       PART_4				                  128000      1948	    104

4 rows selected.

Each partition is 1948 blocks.
Lets change _small_table_threshold to 1800 and select values from single partition. Since partition size we are selecting > _small_table_threshold, it should go for direct path read


DEO>alter session set "_small_table_threshold"=1800;

Session altered.

DEO>select count(1) from T1_PART where col1 = 1;

  COUNT(1)
----------
    128000

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME = 'PART_1') group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93845	      1

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    5166

We are seeing “physical reads direct” stats value as 5166. This is a cumulative value. Previous value was 3210.
So it has read 5166 – 3210 = 1956 blocks which is approx same as number of blocks for 1 partition.

Also, we are see 1 block is read in buffer cache which is header block

To verify that _small_table_threshold is checked at partitions we are selecting and not at table level, lets increase _small_table_threshold to 3600 and continue to select single partition


DEO>alter session set "_small_table_threshold"=3600;

Session altered.

DEO>select count(1) from T1_PART where col1 = 1;

  COUNT(1)
----------
    128000

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    5168

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME = 'PART_1') group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93845	   1949

So no increase in “physical reads direct”, but we can see blocks are read in buffer as sum(NUM_BUF) is showing same as number of blocks for that partition.

If we select 2 partitions, it will again go for direct path read because sum of blocks of 2 partitions > _small_table_threshold (3600)


DEO>select count(1) from T1_PART where col1 in (1,2);

  COUNT(1)
----------
    256000

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    9064

Both partitions went for direct path read even though 1 partition was in buffer cache.
Above stat value if cumulative. It has increased by (9064 – 5168) 3896 blocks which is size of 2 partitions.

We can change this behavior using _direct_read_decision_statistics_driven parameter. Check later in this article.

What happens when the blocks are cached in buffer cache ?

If we have atleast 50% of blocks cached in buffer cache, Oracle will choose to use db cache instead of direct path read when using FTS. ALEX FATKULIN has published a very nice article http://afatkulin.blogspot.ie/2009/01/11g-adaptive-direct-path-reads-what-is.html about this behavior and he has provided a function to test this behavior.

But I am seeing different behavior in my instance. In my tests, Oracle is not going for db cache unless 98% blocks are in buffer cache.

In my case for table T1, we can select half the records from table to have approx 50% blocks cached in buffer cache and check the behavior.

Non-Partitioned Table

Lets increased _small_table_threshold to 5000 so that Oracle does not go for direct path read and read approx half the blocks in buffer cache. We are also flushing buffer cache to make sure nothing is present.


DEO>alter system flush buffer_cache;

System altered.

DEO>alter session set "_small_table_threshold"=5000;

Session altered.

DEO>select count(1) from T1 where rownum < 50000;

  COUNT(1)
----------
     49999

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1') group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93840	   1753

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                     5168

From above output we can see that more than 50% blocks are cached in buffer cache.
Now, lets change _small_table_threshold to 3000 again to check if FTS goes for direct path read or buffer cache.


DEO>alter session set "_small_table_threshold"=3000;

Session altered.

DEO>select count(1) from T1;

  COUNT(1)
----------
     91200

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1') group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93840	   1753

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct					                        8376

So even when we had more than 50% of blocks in buffer cache, its still going for direct path read. SUM(NUM_BUF) didnt change from 1753 to 3209 (all blocks on T1 table).
I increased number of blocks cached in buffer cache to 90% and it will still going for direct path read.

Only when approx 98% of blocks are cached then Oracle goes for db cache. I validated same behavior in 11.2.0.3 as well.

Partitioned Table

I repeated the same process for partitioned table and found same behavior. In this case Oracle checks for each partition instead of table.

Each partition has 1948 blocks. So lets put 90% of blocks in db cache and check behavior.
Lets increase _small_table_threshold to 3000 deliberately so that Oracle uses db cache.


DEO>alter session set "_small_table_threshold"=3000;

Session altered.

DEO>alter system flush buffer_cache;

System altered.

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME = 'PART_1') group by obj#;

no rows selected

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    2

DEO>--selecting 90% of records from PART_1
DEO>select count(*) from T1_PART where col1 = 1 and rownum < (128000*0.9);

  COUNT(*)
----------
    115199

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    2

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME = 'PART_1') group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93845	   1753

We have 90% of blocks of partition PART_1 in buffer cache.
Now we can reduce _small_table_threshold to 1800 and try selecting.


DEO>alter session set "_small_table_threshold"=1800;

Session altered.

DEO>--selecting complete PART_1
DEO>select count(*) from T1_PART where col1 = 1;

  COUNT(*)
----------
    128000

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME = 'PART_1') group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93845	   1753

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    1950

DEO>

As seen above, even with 90% of blocks in db cache, it goes for direct path read.
Lets repeat the same by putting 98% of blocks


DEO>alter session set "_small_table_threshold"=3000;

Session altered.

DEO>alter system flush buffer_cache;

System altered.

DEO>select count(*) from T1_PART where col1 = 1 and rownum < (128000*0.98);

  COUNT(*)
----------
    125439

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME = 'PART_1') group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93845	   1910

DEO>alter session set "_small_table_threshold"=1800;

Session altered.

DEO>select count(*) from T1_PART where col1 = 1;

  COUNT(*)
----------
    128000

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# = (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME = 'PART_1') group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93845	   1949

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                     1950

So with 98% of rows/blocks in db cache, Oracle does NOT go for direct path read.

What if we have 1 partition in cache and we select 2 partitions. Will it go for direct path read ?

I have kept _small_table_threshold=3000 which is more than 1 partition size but less than 2 partitions size


DEO>alter session set "_small_table_threshold"=3000;

Session altered.

DEO>select count(*) from T1_PART where col1 = 1;

  COUNT(*)
----------
    128000

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# in (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME in ('PART_1','PART_2')) group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93845	   1949

We have above PART_1 in db cache and nothing on PART_2. Lets select 2 partitions


DEO>select count(*) from T1_PART where col1 in (1,2);

  COUNT(*)
----------
    256000

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# in (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME in ('PART_1','PART_2')) group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93845	   1949
     93846	      1

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    5848

So even when 1 partition was in db cache, if we select 2 partitions, it will go for direct path read for both partitions.
Above stats value of 5848 is cumulative value from previous value of 1950. Difference comes out to be 3898 which is size in blocks of 2 partitions.

_direct_read_decision_statistics_driven

This parameter mainly affect the behavior of partition table.

We have seen above that if we have a partition cached in db cache and if we select 2 partitions, both partitions will go for direct path read. This happens when _direct_read_decision_statistics_driven is set to true (default).

When _direct_read_decision_statistics_driven parameter is true, Oracle uses table statistics to decide whether to go for direct path read or db cache (using db file scattered read).
When _direct_read_decision_statistics_driven parameter is false, Oracle uses the segment header block (the one which always gets selected in db cache) to decide whehter to go for direct path read or db cache. It means that value of _small_table_threshold will be compared with every segment(partition) and not as a whole with partitions we are selecting.

In all the above excercise _direct_read_decision_statistics_driven was set to true. So Oracle server process used table stats to decide the fetch method.

Let set _direct_read_decision_statistics_driven to false to check behavior.


DEO>alter session set "_direct_read_decision_statistics_driven"=false;

Session altered.

We have 1 partition cached entirely in db cache.


DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# in (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME in ('PART_1','PART_2')) group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93845	   1949
     93846	      1

Current value of physical reads direct stats is as below and _small_table_threshold is set to 1800 which is less than size of single partition


DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    3896

DEO>alter session set "_small_table_threshold" = 1800;

Session altered.

Lets select 2 partitions


DEO>select count(1) from T1_PART where col1 in (1,2);

  COUNT(1)
----------
    256000

1 row selected.

DEO>select a.sid, b.name, a.value from v$sesstat a, v$statname b where a.statistic# = b.statistic# and a.statistic# in (97) and a.sid = 6531;

       SID NAME 								                            VALUE
---------- ---------------------------------------------------------------- ----------
      6531 physical reads direct						                    5846

1 row selected.

DEO>select obj#, sum(num_buf) from X$KCBOQH where obj# in (select object_id from dba_objects where object_name = 'T1_PART' and SUBOBJECT_NAME in ('PART_1','PART_2')) group by obj#;

      OBJ# SUM(NUM_BUF)
---------- ------------
     93845	   1949
     93846	      1

2 rows selected.

If we check carefully, we have direct path read done for only 1 partition (5846 – 3896 = 1950). 1950 is size of single partition.
So Oracle has gone for direct path read for 2nd partition and it has used blocks cached in buffer cache for 1st partition.
Since we have set _direct_read_decision_statistics_driven to false, Oracle reads header block of individual segment to decide fetch method to be used for that segment.
Since 1st segment PART_1 was present entirely in db cache, it decided to use that. PART_2 was not present in db cache, only header block was present so it compared PART_2 size with _small_table_threshold and since PART_2 size > _small_table_threshold, it went with direct path read.

In above exercise (_direct_read_decision_statistics_driven = true), when we set _small_table_threshold to 3600 for partition table and tried selecting 2 partitions, both went for direct path read. Now if we do that, both partitions will go to db cache because size of individual partitions (from header block of each segment) will be compared individually and not as sum of blocks we are selecting.

_serial_direct_read

If you want to disable direct path read on FTS, you can make use of _serial_direct_read parameter. Many DBA are increasing value of _small_table_threshold to very high value (higher than any table in database) and prevents direct path reads. But Oracle has provided this hidden parameter _serial_direct_read, which if set to false will disable direct path reads.

Default value of this parameter is true, which means allow direct path reads.

You can also set event 10949 which can also be used to disable direct path reads. Setting this event is like setting _serial_direct_read=false.

alter session set events '10949 trace name context forever, level 1';

Hope this helps !!

References

https://dioncho.wordpress.com/2009/07/21/disabling-direct-path-read-for-the-serial-full-table-scan-11g/
http://afatkulin.blogspot.ie/2009/01/11g-adaptive-direct-path-reads-what-is.html
http://blog.tanelpoder.com/2012/09/03/optimizer-statistics-driven-direct-path-read-decision-for-full-table-scans-_direct_read_decision_statistics_driven/
http://oracle-tech.blogspot.com.tr/2014/04/smalltablethreshold-and-direct-path.html
http://oracle-tech.blogspot.com.tr/2014/04/directreaddecisionstatistcsdriven.html

Pointing MySQL Read-Replica to new master after failover

Here I am describing a simple scenario where we have a read replica (other than slave) which is currently pointing to master mysql instance.
We have another slave which is also pointing to same master. What we are going to see is, how to point read replica to new master after failover.

Environment:

To make you familier with environment, we have 3 instances of mysql

mysql_a -> Current master
mysql_b -> Current slave
mysql_c -> read replica pointing to mysql_a

Technically read replica is same as slave, but in this context we are saying read replica to make you understand that this is additional slave which does not take part in failover scenario.
So excercise is to failover master from mysql_a to mysql_b and point read replica mysql_c to mysql_b

I am using GTID on all 3 instances and both slaves are in sync with master.

On Master (mysql_a):

root [mysql] >show master status \G
*************************** 1. row ***************************
             File: bin_log.000007
         Position: 612
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: f7718b80-c237-11e4-baa8-a0369f370a52:1-91
1 row in set (0.00 sec)

On Slave (mysql_b):

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
               ...
               ...
              Master_Log_File: bin_log.000007
          Read_Master_Log_Pos: 612
               Relay_Log_File: relay_log.000010
                Relay_Log_Pos: 818
        Relay_Master_Log_File: bin_log.000007
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
            ...
            ...
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: f7718b80-c237-11e4-baa8-a0369f370a52:1-91
            Executed_Gtid_Set: f7718b80-c237-11e4-baa8-a0369f370a52:1-91
                Auto_Position: 1

On read-replica (mysql_c):

root [mysql] >show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
               ...
               ...
              Master_Log_File: bin_log.000007
          Read_Master_Log_Pos: 612
               Relay_Log_File: relay_log.000005
                Relay_Log_Pos: 442
        Relay_Master_Log_File: bin_log.000007
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
            ...
            ...
           Retrieved_Gtid_Set: f7718b80-c237-11e4-baa8-a0369f370a52:91
            Executed_Gtid_Set: 6e68fbf2-f9a0-11e4-a3f7-90e2ba6ebdb0:1-7,
f7718b80-c237-11e4-baa8-a0369f370a52:1-91
                Auto_Position: 1

Lagging read-replica:

Before I failover, I want to make read replica lag by couple of transaction so that when we recover and point replica to new master we can clearly see transactions are getting applied from old and new master.
I will stop IO slave on read replica (mysql_c) so that it will stop reading binlog from master (mysql_a)

root [mysql] >stop slave IO_thread;
Query OK, 0 rows affected (0.03 sec)

root [mysql] >show slave status \G
*************************** 1. row ***************************
               Slave_IO_State:
               ...
               ...
             Slave_IO_Running: No
            Slave_SQL_Running: Yes

So any change to master will not be replicated to read replica (mysql_c)

Now, lets change few records in master and move forward its GTID

On Master (mysql_a):

root [deo] >show master status \G
*************************** 1. row ***************************
             File: bin_log.000007
         Position: 1220
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: f7718b80-c237-11e4-baa8-a0369f370a52:1-93
1 row in set (0.00 sec)

I did couple of transactions on master (mysql_a) and increased GTID from 1-91 to 1-93.

I see that those trasactions are already applied to slave (mysql_b)

On Slave (mysql_b):

           Retrieved_Gtid_Set: f7718b80-c237-11e4-baa8-a0369f370a52:1-93
            Executed_Gtid_Set: f7718b80-c237-11e4-baa8-a0369f370a52:1-93
                Auto_Position: 1
1 row in set (0.00 sec)

But we didnt get those transaction on read replica (mysql_c) as IO thread is down

On read replica (mysql_c):

           Retrieved_Gtid_Set: f7718b80-c237-11e4-baa8-a0369f370a52:91
            Executed_Gtid_Set: 6e68fbf2-f9a0-11e4-a3f7-90e2ba6ebdb0:1-7,
f7718b80-c237-11e4-baa8-a0369f370a52:1-91
                Auto_Position: 1

Failover:

Now, we will failover master to slave (mysql_b)

Failover is simple process. We just have to stop current master and make sure slave is not read only. Plus any other custom configuration that you have done on master, you can need to do the same on slave.

So after failover to mysql_b, my new master looks like below

mysql> show master status \G
*************************** 1. row ***************************
             File: bin_log.000003
         Position: 2495
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: f7718b80-c237-11e4-baa8-a0369f370a52:1-93
1 row in set (0.00 sec)

we can also do reset master, that will flush binlog file and reset GTID to 1. But lets not do that as that is not the objective of our excercise.

Pointing read replica to new master:

Now, we need to point our read replica (which is still pointing to old master (mysql_a)) to new master (mysql_b)
To do that we need to recover the transactions that were missing from old master and continue receving transactions from new master.

When we do new transactions on new master, we get different GTID.
Example, I did 1 transaction after failover on new master and my new master status looks like below

mysql> show master status \G
*************************** 1. row ***************************
             File: bin_log.000003
         Position: 2802
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: c21f28d2-c243-11e4-baf5-2c600c20dba4:1,
f7718b80-c237-11e4-baa8-a0369f370a52:1-93
1 row in set (0.00 sec)
c21f28d2-c243-11e4-baf5-2c600c20dba4:1 is the new GTID

If we just start slave on read replica it will not be able to do fetch any transaction as old master is down. So we need to first point read replica to new master and then start slave which will fetch all old and new transactions

root [mysql] >stop slave;
Query OK, 0 rows affected (0.03 sec)

root [mysql] >change master to master_host='mysql_b.example.com', master_port=3306, master_user='replicate', MASTER_PASSWORD = 'welcome', MASTER_AUTO_POSITION = 1;
Query OK, 0 rows affected, 2 warnings (0.07 sec)

root [mysql] >start slave;
Query OK, 0 rows affected (0.03 sec)

root [mysql] >show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: mysql_b.example.com
                  ...
                  ...
           Retrieved_Gtid_Set: c21f28d2-c243-11e4-baf5-2c600c20dba4:1,
f7718b80-c237-11e4-baa8-a0369f370a52:92-93
            Executed_Gtid_Set: 6e68fbf2-f9a0-11e4-a3f7-90e2ba6ebdb0:1-7,
c21f28d2-c243-11e4-baf5-2c600c20dba4:1,
f7718b80-c237-11e4-baa8-a0369f370a52:1-93
                Auto_Position: 1

In above output of slave status, we can check Retrieved_Gtid_Set and we can clearly see it has retrieved previous master’s 2 transactions f7718b80-c237-11e4-baa8-a0369f370a52:92-93 as well as new transaction that I did after failover – c21f28d2-c243-11e4-baf5-2c600c20dba4:1
In Executed_Gtid_Set we can see it has executed all these transactions – c21f28d2-c243-11e4-baf5-2c600c20dba4:1 and f7718b80-c237-11e4-baa8-a0369f370a52:1-93

Hope this helps !!

Apache Cassandra – NoSQL storage solution

These days I am exploring another storage solution – Cassandra.

Apache Cassandra datastore was originally developed by Facebook as open source NoSQL data storage system. Its actually based on Amazon’s dynamoDB database. Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

Datastax Technology has created enterprise edition of Cassandra which is built on Apache Cassandra. Today we have multiple flavors of Cassandra available from Apache as well as datastax.

Cassandra is a NoSQL database storage solution and it stores the data using simple key-value pairs. Along with enterprise software, datastax also provide huge documentation for learning Cassandra. They also provide self-paced training and instructor led training for learning Cassandra.

I have started learning Cassandra using self-paced training available at following location – https://academy.datastax.com/courses

Apart from that, datastax also has very active blog where they discuss different issues and features available in Cassandra – http://www.datastax.com/dev/blog/

Installation:

You can either go with full installation of Cassandra on multiple physical nodes and creating a cluster or you can simulate a cluster on single node using CCM (Cassandra Cluster Manager).

Going for official Cassandra software on multiple physical nodes might not be feasible for everyone. Thats why CCM is the best utility to learn Cassandra.

You can find instruction to install CCM at following location – http://www.datastax.com/dev/blog/ccm-a-development-tool-for-creating-local-cassandra-clusters

Valid Versions:

At the time of this writing, most stable version of Apache Cassandra is 2.1.14. Latest version of Apache Cassandra released is 2.1.15. You have older version like 2.0.9 which was also stable.

You can get complete list of Apache Cassandra at – http://archive.apache.org/dist/cassandra/

You can check datastax community versions of Cassandra at http://planetcassandra.org/cassandra/

Community version is for learning and is free to download, install and play around.

CCM Installation Issue:

I faced following issue when I installed CCM on my ubuntu 12.04 machine.

ccm create --version=2.0.9 --nodes=6 deo
Downloading http://archive.apache.org/dist/cassandra/2.0.9/apache-cassandra-2.0.9-src.tar.gz to /tmp/ccm-2oKzAH.tar.gz (10.810MB)
  11335077  [100.00%]
Extracting /tmp/ccm-2oKzAH.tar.gz as version 2.0.9 ...
Compiling Cassandra 2.0.9 ...
Deleted /home/local/advaitd/.ccm/repository/2.0.9 due to error
Traceback (most recent call last):
  File "/usr/local/bin/ccm", line 5, in <module>
    pkg_resources.run_script('ccm==2.0.3.1', 'ccm')
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1235, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/ccm-2.0.3.1-py2.7.egg/EGG-INFO/scripts/ccm", line 72, in <module>
    cmd.run()
  File "/usr/local/lib/python2.7/dist-packages/ccm-2.0.3.1-py2.7.egg/ccmlib/cmds/cluster_cmds.py", line 127, in run
    cluster = Cluster(self.path, self.name, install_dir=self.options.install_dir, version=self.options.version, verbose=True)
  File "/usr/local/lib/python2.7/dist-packages/ccm-2.0.3.1-py2.7.egg/ccmlib/cluster.py", line 51, in __init__
    dir, v = self.load_from_repository(version, verbose)
  File "/usr/local/lib/python2.7/dist-packages/ccm-2.0.3.1-py2.7.egg/ccmlib/cluster.py", line 64, in load_from_repository
    return repository.setup(version, verbose)
  File "/usr/local/lib/python2.7/dist-packages/ccm-2.0.3.1-py2.7.egg/ccmlib/repository.py", line 40, in setup
    download_version(version, verbose=verbose, binary=binary)
  File "/usr/local/lib/python2.7/dist-packages/ccm-2.0.3.1-py2.7.egg/ccmlib/repository.py", line 221, in download_version
    raise e
ccmlib.common.CCMError: Error compiling Cassandra. See /home/local/advaitd/.ccm/repository/last.log for details

 

I posted the same error on github community and immediately got a solution – https://github.com/pcmanus/ccm/issues/268

Suggested me to use binary version of Cassandra for download -v binary:2.0.9. Cluster creation was successful after using binary version.

You can create as many nodes cluster as you want to. All it does is, it creates those many directories and treat them as separate nodes.

I create six node cluster on my ubuntu machine.

advaitd@desktop:~$ ccm status
Cluster: 'deo'
--------------
node1: UP
node3: UP
node2: UP
node5: UP
node4: UP
node6: UP

CCM Installation details:

CCM creates hidden directory under your home directory and a separate installation directory for each node under that hidden directory as shown below.

advaitd@desktop:~$ pwd
/home/local/advaitd
advaitd@desktop:~$ ls -rlt .ccm
total 12
drwxr-xr-x 3 advaitd domain^users 4096 Apr 23 12:15 repository
-rw-r--r-- 1 advaitd domain^users    4 Apr 23 12:15 CURRENT
drwxr-xr-x 8 advaitd domain^users 4096 Apr 23 12:15 deo
advaitd@desktop:~$ ls -rlt .ccm/deo/
total 28
-rw-r--r-- 1 advaitd domain^users  291 Apr 23 12:15 cluster.conf
drwxr-xr-x 8 advaitd domain^users 4096 Apr 23 12:16 node2
drwxr-xr-x 8 advaitd domain^users 4096 Apr 23 12:16 node1
drwxr-xr-x 8 advaitd domain^users 4096 Apr 23 12:16 node5
drwxr-xr-x 8 advaitd domain^users 4096 Apr 23 12:16 node3
drwxr-xr-x 8 advaitd domain^users 4096 Apr 23 12:16 node6
drwxr-xr-x 8 advaitd domain^users 4096 Apr 23 12:16 node4
advaitd@desktop:~$

So in each of the above node directory we have a Cassandra software installed. Each of above node directory is considered as separate node and cluster is created.

Cassandra binary is running from each of the node directory. So we should be seeing 6 cassandra processes running on that host as shown below.

advaitd@desktop:~$ ps -aef | grep cassandra | grep -v grep | wc -l
6

I will be doing more learning and posting articles on Cassandra as well.

References:

http://cassandra.apache.org/

http://www.datastax.com/

http://en.wikipedia.org/wiki/Apache_Cassandra

https://github.com/pcmanus/ccm/issues

http://www.datastax.com/blog

http://docs.datastax.com/en/index.html