Oracle RAC: Cluster reconfiguration steps

Oracle RACs cluster reconfiguration process entails seven steps. Learn what they are in this excerpt from "Oracle Database 10g: Real Application Clusters Handbook."

Cluster reconfiguration steps

The cluster reconfiguration process triggers IMR, and a seven-step process ensures complete reconfiguration.

  1. Name service is frozen. The CGS contains an internal database of all the members/ instances in the cluster with all their configuration and servicing details. The name service provides a mechanism to address this configuration data in a structured and synchronized manner.
  2. Lock database (IDLM) is frozen. The lock database is frozen to prevent processes from obtaining locks on resources that were mastered by the departing/dead instance.
  3. Determination of membership and validation and IMR.
  4. Bitmap rebuild takes place, instance name and uniqueness verification. CGS must synchronize the cluster to be sure that all members get the reconfiguration event and that they all see the same bitmap.
  5. Delete all dead instance entries and republish all names newly configured.
  6. Unfreeze and release name service for use.
  7. Hand over reconfiguration to GES/GCS.

Now that you know when IMR starts and node evictions take place, let's look at the corresponding messages in the alert log and LMON trace files to get a better picture. (The logs have been edited for brevity. Note all the lines in boldface define the most important steps in IMR and the handoff to other recovery steps in CGS.)

Problem with a Node Assume a four-node cluster (instances A, B, C, and D), in which instance C has a problem communicating with other nodes because its private link is down. All other services on this node are assumed to be working normally.

Alter log on instance C:

ORA-29740: evicted by member 2, group incarnation 6 Thu Jun 30 09:15:59 2005 LMON: terminating instance due to error 29740 Instance terminated by LMON, pid = 692304 … … …

Alter log on instance A:

Thu Jun 30 09:15:59 2005 Communications reconfiguration: instance 2 Evicting instance 3 from cluster Thu Jun 30 09:16:29 2005 Trace dumping is performing id=[50630091559] Thu Jun 30 09:16:31 2005 Waiting for instances to leave: 3 Thu Jun 30 09:16:51 2005 Waiting for instances to leave: 3 Thu Jun 30 09:17:04 2005 Reconfiguration started List of nodes: 0,1,3, Global Resource Directory frozen Communication channels reestablished Master broadcasted resource hash value bitmaps Thu Jun 30 09:17:04 2005 Reconfiguration started

LMON trace file on instance A:

*** 2005-06-30 09:15:58.262 kjxgrgetresults: Detect reconfig from 1, seq 12, reason 3 kjxgfipccb: msg 0x1113dcfa8, mbo 0x1113dcfa0, type 22, ack 0, ref 0, stat 3 kjxgfipccb: Send timed out, stat 3 inst 2, type 22, tkt (10496,1496) *** 2005-06-30 09:15:59.070 kjxgrcomerr: Communications reconfig: instance 2 (12,4) Submitting asynchronized dump request [2] kjxgfipccb: msg 0x1113d9498, mbo 0x1113d9490, type 22, ack 0, ref 0, stat 6 kjxgfipccb: Send cancelled, stat 6 inst 2, type 22, tkt (10168,1496) kjxgfipccb: msg 0x1113e54a8, mbo 0x1113e54a0, type 22, ack 0, ref 0, stat 6 kjxgfipccb: Send cancelled, stat 6 inst 2, type 22, tkt (9840,1496) Note that Send timed out, stat 3 inst 2 is LMON trying send message(s) to the broken instance. kjxgrrcfgchk: Initiating reconfig, reason 3 /* IMR Initiated */ *** 2005-06-30 09:16:03.305 kjxgmrcfg: Reconfiguration started, reason 3 kjxgmcs: Setting state to 12 0. *** 2005-06-30 09:16:03.449 Name Service frozen kjxgmcs: Setting state to 12 1. *** 2005-06-30 09:16:11.570 Voting results, upd 1, seq 13, bitmap: 0 1 3 Note that instance A has not tallied the vote; hence it has received only the voting results. Here is an extract from the LMON trace file on instance B, which managed to tally the vote: Obtained RR update lock for sequence 13, RR seq 13 *** 2005-06-30 09:16:11.570 Voting results, upd 0, seq 13, bitmap: 0 1 3 … …

Here's the LMON trace file on instance A:

Evicting mem 2, stat 0x0007 err 0x0002 kjxgmps: proposing substate 2 kjxgmcs: Setting state to 13 2. Performed the unique instance identification check kjxgmps: proposing substate 3 kjxgmcs: Setting state to 13 3. Name Service recovery started Deleted all dead-instance name entries kjxgmps: proposing substate 4 kjxgmcs: Setting state to 13 4. Multicasted all local name entries for publish Replayed all pending requests kjxgmps: proposing substate 5 kjxgmcs: Setting state to 13 5. Name Service normal Name Service recovery done *** 2005-06-30 09:17:04.369 kjxgmrcfg: Reconfiguration started, reason 1 kjxgmcs: Setting state to 13 0. *** 2005-06-30 09:17:04.371 Name Service frozen kjxgmcs: Setting state to 13 1. GES/GCS recovery starts here: Global Resource Directory frozen node 0 node 1 node 3 res_master_weight for node 0 is 632960 res_master_weight for node 1 is 632960 res_master_weight for node 3 is 632960 … … …

Death of a Member For the same four-node cluster (A, B, C, and D), instance C has died unexpectedly:

kjxgrnbrisalive: (3, 4) not beating, HB: 561027672, 561027672 *** 2005-06-19 00:30:52.018 kjxgrnbrdead: Detected death of 3, initiating reconfig kjxgrrcfgchk: Initiating reconfig, reason 2 *** 2005-06-19 00:30:57.035 kjxgmrcfg: Reconfiguration started, reason 2 kjxgmcs: Setting state to 6 0. *** 2005-06-19 00:30:57.037 Name Service frozen kjxgmcs: Setting state to 6 1. *** 2005-06-19 00:30:57.239 Obtained RR update lock for sequence 6, RR seq 6 *** 2005-06-19 00:33:27.261 Voting results, upd 0, seq 7, bitmap: 0 2 Evicting mem 3, stat 0x0007 err 0x0001 kjxgmps: proposing substate 2 kjxgmcs: Setting state to 7 2. Performed the unique instance identification check kjxgmps: proposing substate 3 kjxgmcs: Setting state to 7 3. Name Service recovery started Deleted all dead-instance name entries kjxgmps: proposing substate 4 kjxgmps: proposing substate 4 kjxgmcs: Setting state to 7 4. Multicasted all local name entries for publish Replayed all pending requests kjxgmps: proposing substate 5 kjxgmcs: Setting state to 7 5. Name Service normal Name Service recovery done *** 2005-06-19 00:33:27.266 kjxgmps: proposing substate 6 … … … kjxgmps: proposing substate 2

GES/GCS recovery starts here:

Global Resource Directory frozen node 0 node 2 res_master_weight for node 0 is 632960 res_master_weight for node 2 is 632960 Total master weight = 1265920 Dead inst 3 Join inst Exist inst 0 2

Use the following table of contents to navigate to chapter excerpts or click here to view RAC Troubleshooting in its entirety.


Oracle Database 10g: Real Application Clusters Handbook
  Home: Oracle RAC troubleshooting: Introduction
  1: Oracle RAC: Log directory structure in cluster ready services
  2: Oracle RAC: Log directory structure in Oracle RDBMS
  3: Oracle RAC and the Lamport algorithm
 4: Oracle RAC: ON and OFF
  5: Oracle RAC: Database performance issues
  6: Oracle RAC: Debugging node eviction issues
  7: Oracle RAC: Member voting
  8: Oracle RAC: Cluster reconfiguration steps
  9: Oracle RAC: Debugging CRS and GSD using DTRACING
About the book:   
Oracle Database 10g: Real Applications Clusters Handbook Learn to implement Oracle real application clusters from the ground up. Maximize database availability, scalability, and efficiency. Find RAC concepts, administration, tuning, and troubleshooting information. You'll learn how to prepare and create Oracle RAC databases and servers, and automate administrative tasks. You'll also get full coverage of cutting-edge Oracle RAC diagnostic tools, backup and recovery procedures, performance tweaks and custom application design strategies. Buy this book at McGraw-Hill/Osborne
About the author:   
K Gopalakrishnan is a senior principal consultant with the Advanced Technology Services group at Oracle Corporation, specializing exclusively in performance tuning, high availability, and disaster recovery. He is a recognized expert in Oracle RAC and Database Internals and has used his extensive expertise in solving many vexing performance issues all across the world for telecom giants, banks, financial institutions, and universities.

This was first published in May 2007

Dig deeper on Database Management Products and Solutions

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

MicroscopeUK

SearchCloudProvider

SearchSecurity

SearchStorage

SearchNetworking

SearchCloudComputing

SearchConsumerization

SearchDataManagement

SearchBusinessAnalytics

Close