DB2的crash recovery原理

 

If a failure occurs while the instance is running and DB2(R) cannot complete or commit all of the outstanding units of work, the database is left in an inconsistent and unusable state. Crash recovery is the process by which the database is moved back to a consistent and usable state.

When such a situation occurs, you will see entries in the db2diag.log similar to the following when a user tries to connect to or activate the database:

2005-02-04-11.03.01.920391-480 I3693708C368 LEVEL: Warning
PID : 53980 TID : 1 PROC : db2agent (SAMPLE) 0
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050114231611
FUNCTION: DB2 UDB, base sys utilities, sqledint, probe:30
MESSAGE : Crash Recovery is needed. 

2005-02-04-11.03.05.223292-480 I3694077C431 LEVEL: Warning
PID : 53980 TID : 1 PROC : db2agent (SAMPLE) 0
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050114231611
FUNCTION: DB2 UDB, recovery manager, sqlpresr, probe:410
MESSAGE : Crash recovery started. LowtranLSN 00000000E0D7D193 MinbuffLSN
00000000E0D7CC4D

LSN stands for Log Sequence Number. Database manager identifies and tracks transactions using LSN numbers. Note that an application-level "transaction" may be split into many LSNs internally, for example, delete of all rows from a table in a single unit of work. The entire deletion is one transaction but internally there will be one LSN logged for every deleted row.

The LowtranLSN and MinbuffLSN values supplied here are helpful if a problem occurs during the recovery, since they give insight into the log sequence numbers that crash recovery will be processing.

The LowtranLSN value is the log sequence number of the first log record written by the oldest uncommitted transaction. The age of the transaction here is based on the time transactions write their first log records, not the unit of work (UOW) start time reported by database monitor.

DB2 uses write-ahead logging; this means that every transaction is first written to log files and then to the database. Minbuff LSN indicates the oldest LSN that has been committed but not flushed to the database.

Crash recovery will start using the lowest of MINBUFLSN and LOWTRANLSN (in some cases they are the same).

If you are using log retention logging (as opposed to circular logging), you can use the db2flsn tool to match a log sequence number to a log file.

Recovery is comprised of several "phases". If crash recovery fails, it is helpful to know which phase it was in at the time. The phases are:

  1. Redo or forward phase: During this phase all available transactions, committed or uncommitted are replayed.
  2. Undo Phase - During this phase, uncommitted transactions are rolled back. This is the second phase. Here, DB2 will rollback any transactions that were uncommitted at the time of the crash. This can potentially be a lengthy amount of time if, for example, an application had done a large amount of work without committing before it crashed. At the end of redo phase, DB2 compiles a list of all the transactions for which there were no commit records. This becomes the list of unresolved or uncommitted transactions. The logger then writes compensation or undo log records in order to roll back these transactions.
  3. Sideward phase (Multi-partition only) - This is the last phase. If any synchronization needs to be done between the current partition and the catalog partition, it will be done during this phase.

Note that if you are concerned that recovery seems to be taking a long time, you can monitor its status using the LIST UTILITIES command.

If the recovery completes successfully, you will see an entry like this:

2005-02-04-11.03.05.605091-480 I3696386C497 LEVEL: Warning
PID : 53980 TID : 1 PROC : db2agent (SAMPLE) 0
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050114231611
FUNCTION: DB2 UDB, recovery manager, sqlpresr, probe:3170
MESSAGE : Crash recovery completed. Next LSN is 000000750CF31816

Whereas if the recovery fails, a return code would be supplied as follows:

2005-02-04-11.03.05.605091-480 I3696386C497 LEVEL: Error
PID : 53980 TID : 1 PROC : db2agent (SAMPLE) 0
INSTANCE: db2inst1 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050114231611
FUNCTION: DB2 UDB, recovery manager, sqlpresr, probe:640
DATA #1 : String, 140 bytes
Crash recovery failed with rc -2028994521
LowtranLSN 00000000E0D7CC4D MinbuffLSN 00000000E0D7CC4D
Recovery started on log file S0000153.LOG.

These crash recovery return codes are not documented, but quite often you can uncover clues to the cause of the problem by looking at earlier db2diag.log entries. For example, there may be indications of log corruption, bad log records, or a bad page on disk. An example of this type of message in the db2diag.log is as follows:

2005-02-24-12.05.42.265000+480 I2548H467 LEVEL: Error
PID : 2048 TID : 2016 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050224040537
FUNCTION: DB2 UDB, buffer pool services, sqlbcres, probe:1100
RETCODE : ZRC=0x86020019=-2046689255=SQLB_CSUM "Bad Page, Checksum Error"
DIA8426C A invalid page checksum was found for page "".

2005-02-24-12.05.42.281000+480 I3017H374 LEVEL: Error
PID : 2048 TID : 2016 PROC : db2syscs.exe
INSTANCE: DB2 NODE : 000 DB : SAMPLE
APPHDL : 0-8 APPID: *LOCAL.DB2.050224040537
FUNCTION: DB2 UDB, buffer pool services, sqlbcres, probe:1100
MESSAGE : Obj={pool:2;obj:12;type:0} State=x27

The second entry in this example indicates that object 12 in tablespace 2 is corrupted. To confirm whether or not the table is corrupted, you could use the db2dart tool.

The benefit of analyzing these messages is that they may provide DBAs with clues as to what actions need to be taken next in order to get the database back up and running. For example, you may be able to:

  • Mark the corrupted objects as being "bad" using the db2dart tool and then in skip these objects during a subsequent attempt at crash recovery
  • Restore the database from an earlier backup

Note that while many of these actions will allow you to get up and running quickly, they may also reduce the possibility that DB2 Support will be able to do complete cause analysis thereafter. Thus this should be taken into consideration when you are deciding whether to take these actions prior to calling DB2 Suppport.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章