2011-04-28

ORA-752 or ORA-600 [3020]

Resolving ORA-752 or ORA-600 [3020] During Standby Recovery [ID 1265884.1]

Applies to:

Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.2.0.2 - Release: 10.2 to 11.2
Information in this document applies to any platform.

Symptoms

Standby Redo Apply can terminate due to a failure of redo-data consistency checks, a problem called stuck recovery. Stuck recovery can occur when an underlying operating system or storage system loses a write issued by the Primary or Standby database during normal operation. Because there is an inconsistency between the information stored in the redo and the information stored in a database block being recovered, the database signals an internal error when applying the redo.

ORA-00600: internal error code, arguments: [3020], [2885689059], [1], [419819],[26750], [808], [], []
ORA-10567: Redo is inconsistent with data block (file# 1, block# 419819)
ORA-10564: tablespace USER1
ORA-01110: data file ’/oracle/datafiles/user1.dbf’

Cause

The ORA-600 [3020] stuck recovery error could occur on the Standby database for several reasons including: a lost write on the Primary, a lost write on the Standby, missing redo, or logical corruption on the primary resulting in an incomplete redo chain.

Note: With DB_LOST_WRITE_PROTECT enabled on the Primary and Standby, the Standby Redo Apply terminates with the ORA-752 error when a Primary lost write is detected.

ORA-752: recovery detected a lost write of a data block

This ORA-752 error indicates a lost write occurred on the Primary database. Oracle strongly recommends enabling DB_LOST_WRITE_PROTECT (and DB_BLOCK_CHECKSUM=FULL) for greater detection and protection from lost writes. Studies have shown the impact on the primary database is negligible.

Solution

In the majority of cases, Standby stuck recovery errors indicate a corruption of the Primary database. No errors may have been reported on the Primary.
WARNING: Do not repair the Standby by restoring a backup taken on the Primary, as that will ensure that the Standby is also corrupt! The only exception is when the Standby is known to have a lost write, but this determination should be made by Oracle Support.
An ORA-752 error definitively identifies a lost write on the Primary. Consider failing over to the Standby immediately if data integrity is critical and some data loss is acceptable. Oracle Support should also be engaged immediately when an ORA-600 [3020] error occurs by opening a Service Request via My Oracle Support.
When media recovery encounters a problem, the alert log may indicate that recovery can continue if it is allowed to corrupt the data block causing the problem. The alert log contains information about the block: its block type, block address, the tablespace it belongs to, and so forth. For blocks containing user data, the alert log may also report the data object number.

For a block containing user data, you can query the database to find out which object or table owns this block. If the block belongs to an object that can be recreated or is unimportant then it might be advisable to allow recovery to proceed after marking the block corrupt. This procedure is covered in a subsequent section.

Determining the Extent of Corruption

To determine if the corruption is isolated run a diagnostic trial recovery, this scans the redo for problems but does not actually make any changes to the recovered database. Trial recovery reports any additional corruptions in the alert_<SID>.log. You can use the RECOVER ... TEST statement to invoke trial recovery. Refer to Document 283262.1 for additional details on trial recovery.

Determining Root Cause

Information that needs to be gathered and sent to Oracle Support immediately:
  1. Complete or an extract from the alert.log covering at least the period from the last successful database startup.
  2. RDA report (or at least init.ora or the spfile). See Document 314422.1
  3. All tracefiles generated at and after the time of failure.
  4. Any system and I/O subsytem log/error files covering the period from the last successful startup.
  5. Dump of the controlfiles
    SQL> alter session set events 'immediate trace name controlf level xx';
  6. Dump of the datafile headers:
    SQL> alter session set events 'immediate trace name file_hdrs level 10';
  7. Dump of the redo log headers:
    SQL> alter session set events 'immediate trace name redohdr level 10';

What actions can be taken when an ORA-752 lost write error is signalled?

Option 1: Determine if affected objects can be recreated and recovery allowed to continue:
  1. First determine the affected objects. The alert log message will provide the datafile number along with the corresponding block number. For blocks containing user data, the alert log may also report the data object number. Using this information you can determine which objects are affected by the corruption:
    SQL> select SEGMENT_NAME from DBA_EXTENTS
         where FILE_ID=&file_number and
         &block_number BETWEEN BLOCK_ID and BLOCK_ID+BLOCKS-1;

    If the error provides the object number determine the affected object with the following query:
    SQL> select OWNER, OBJECT_NAME, SUBOBJECT_NAME, OBJECT_TYPE
         from DBA_OBJECTS
         where DATA_OBJECT_ID = &object_number;

  2. If feasible, drop and recreate the affected objects on the primary.
  3. Once the objects have been recreated, use the following procedure to skip corrupted block on the standby:
    1. Temporarily disable lost write protection on the standby:
      SQL> ALTER SYSTEM SET DB_LOST_WRITE_PROTECT = NONE; 

    2. Allow recovery to proceed in spite of block corruptions by running the RECOVER command with the ALLOW n CORRUPTION clause, where n is the number of allowable corrupt blocks.
      SQL> alter database recover automatic standby database
           allow 1 corruption;

    3. Once the alert log indicates the blocks have been marked corrupt, restart managed recovery.
      SQL> alter database recover cancel;
      SQL> alter database recover managed standby database
           using current logfile disconnect;
Option 2: Activate the standby database
If the affected objects cannot be recreated then activate the standby database. By activating the standby database you will be experiencing data loss but data integrity will be ensured.
  1. Issue the following SQL statement on the standby database to convert it to a primary:
    SQL> ALTER DATABASE ACTIVATE STANDBY DATABASE;

  2. Back up the new primary. Performing a backup immediately is a necessary safety measure, because you cannot recover changes made after the failover without a complete backup copy of the database. As a result of the failover, the original primary database can no longer participate in the Data Guard configuration, and all other standby databases will now receive and apply redo data from the new primary database.
  3. Open the new primary database.
  4. An optional step is to recreate the failed primary as a physical standby. This can be done using the database backup taken at the new primary in step 3. (You cannot use flashback database or the Data Guard broker to reinstantiate the old primary database in this situation.)

    Be aware that a physical standby created using the backup taken from the new primary will have the same datafiles as the old standby. Therefore, any undetected lost writes that the old standby had before it was activated will not be detected by the new standby, since the new standby will be comparing the same blocks. Any new lost writes that happen on either the primary or the standby will be detected.

What action can be taken when an ORA-600 [3020] is signalled?

Engage Oracle Support immediately when an ORA-600 [3020] error occurs. Be prepared to supply the information listed under the “Determining the Root Cause” section when opening a Service Request via My Oracle Support.

Protecting Against Lost Writes

Protect against lost write by setting DB_LOST_WRITE_PROTECT to TYPICAL on primary and standby databases. By doing so the physical standby Redo Apply process will compare the block SCN on the standby to the block SCN stored in the primary redo stream (when the block was read) to decide whether there is a lost write on the primary. Redo Apply is able to do this using the additional information logged at the primary when DB_LOST_WRITE_PROTECT is enabled. If the block SCN on the primary database is lower than on the standby database, then it detects a lost write on the primary database and throws an external error (ORA-752).

Oracle strongly recommends enabling DB_LOST_WRITE_PROTECT and DB_BLOCK_CHECKSUM=FULL for greater detection and protection from lost writes. Studies have shown the impact on the primary database is negligible.

Niciun comentariu:

Trimiteți un comentariu