ORA-01555错误浅析(3)

日期： 2005-09-07 14:50:11

分享到

1555错误发生的情况

下面我们就模拟一下1555错误发生的情况。

测试环境

首先建立测试环境。由于我们只是要模拟1555错误的发生，所以需要建立一个小的回滚表空间，并且设置undo_retention时间为1（秒），以便回滚数据尽快被覆盖（呵呵，要防止1555错误发生，这就一定要避免的）。
CREATE UNDO TABLESPACE rbs_ts


                                                                                                DATAFILE 'rbs_ts2.dbf' SIZE 10M AUTOEXTEND OFF;
 
alter system set undo_retention=1 scope=spfile;
 
alter system set undo_management=auto scope=spfile;
 
alter system set undo_tablespace=rbs_ts scope=spfile;
 
startup force
 
alter tablespace rbs_ts online;
 
create table demo.t_dual as select * from dual;
 
insert into t_dual values(1);
 
commit;
 
一致性读导致的1555错误

开始读取表。
SQL>
SQL> var cl refcursor
SQL> begin
  2  open :cl for select * from demo.t_multiver;
  3  end;
  4  /
 
PL/SQL procedure successfully completed.
 
SQL>
更新表数据，产生回滚信息。
SQL> update demo.t_multiver set b = 111 where a = 1;
 
1 row updated.
 
SQL> commit;
 
Commit complete.
 
运行大批其他事务，充满所有回滚段，以致覆盖上面的回滚信息。回滚段可以通过dba_rollback_segs查看。
SQL> begin
  2    for i in 1..20000 loop
  3      update demo.t_dual set dummy=1;
  4      commit;
  5    end loop;
  6  end;
  7
  8  /
 
PL/SQL procedure successfully completed.
 
SQL> /
 
PL/SQL procedure successfully completed.
 
查询到更新过的数据记录，回滚信息已经被覆盖，所以报1555错误。
SQL> print :cl
ERROR:
ORA-01555: snapshot too old: rollback segment number 18 with name "_SYSSMU18$"
too small
 
 
 
no rows selected
 
SQL>
延迟块清除导致的1555错误

开始读取表。
SQL> var cc refcursor
SQL>
SQL> begin
  2  open :cc for select * from t_multiver;
  3  end;
  4  /
 
这时一个事务更新了该数据块，但在提交前，我们手工将buffer cache中的数据做了flush，再做提交。这时的数据块上只记录了锁标志，没有事务标志和Commit SCN。
PL/SQL procedure successfully completed.
 
SQL>
SQL> update t_multiver set b=115 where a=1;
 
1 row updated.
 
SQL>
SQL> alter system flush buffer_cache;
 
System altered.
 
SQL>
SQL> commit;
 
Commit complete.
 
进行非常多的事务，将回滚段中的事务信息表中的数据全部覆盖：
SQL>
SQL> begin
  2    -- overwrite rollback slot
  3    for i in 1..40000 loop
  4      update t_dual set dummy=1;
  5      commit;
  6    end loop;
  7  end;
  8  /
 
PL/SQL procedure successfully completed.
 
读取数据块前需要到回滚段的事务信息表中读取Itl中没有标记完全的事务的状态和Commit SCN，以判断是否需要进行一致性读。但是事务信息表中的数据都已经被覆盖，所以报1555错误：
SQL>
SQL> print :cc
ERROR:
ORA-01555: snapshot too old: rollback segment number 20 with name "_SYSSMU20$"
too small
 
 
 
no rows selected
以上两个例子看起来是好像很类似，但是，他们的本质区别是：第一个实际上是在进行一致性读得时候发生的1555错误，而第二个例子是在判断是否需要进行一致性读得时候发生的1555错误。

解决1555错误的方法

现在，我们已经知道了1555错误产生的原因。那么，就可以总结出以下方法来解决1555错误问题：

1、扩大回滚段

因为回滚段是循环使用的，如果回滚段足够大，那么那些被提交的数据信息就能保存足够长的时间是那些大事务完成一致性读取。

2、增加undo_retention时间

在undo_retention规定的时间内，任何其他事务都不能覆盖这些数据。

3、优化相关查询语句，减少一致性读。

减少查询语句的一致性读，就降低读取不到回滚段数据的风险。这一点非常重要！

4、减少不必要的事务提交

提交的事务越少，产生的回滚段信息就越少。

5、对大事务指定回滚段

通过以下语句可以指定事务的回滚段：
SET TRANSACTION USE ROLLBACK SEGMENT rollback_segment；
给大事务指定回滚段，即降低大事务回滚信息覆盖其他事务的回滚信息的几率，又降低了他自身的回滚信息被覆盖的几率。大事务的存在，往往是1555错误产生的诱因。

6、使用游标时尽量使用显式游标，并且只在需要的时候打开游标，同时将所有可以在游标外做的操作从游标循环中拿出。

当游标打开时，查询就开始了，直到游标关闭。减少游标的打开时间，就减少了1555错误发生的几率。

下面例子中，第一段代码发生1555错误的几率就大于第二段的：

差的：
declare 


cursor cl is select b from demo.t_multiver;
v_b number;


begin


open cl; 
--do some thing without relation to the cursor.
 
fetch cl into v_b;


while cl%found loop
   --do other things without relation to the cursor.
  ... ...


  fetch cl into v_b;


end loop;
close cl; 
commit; 


END; 
好的：
declare 


cursor cl is select * from demo.t_multiver;


begin
--do some thing without relation to the cursor.
--do other things without relation to the cursor.
 
open cl; 


fetch cl into v_b;


while cl%found loop
  ... ...


  fetch cl into v_b;


end loop;
close cl; 
commit; 


END; 
7、使用回滚表空间自动管理

回滚表空间自动管理是9i后的特性。他由Oracle自动管理回滚段的创建和回收。尽管有人认为这一特性是以后牺牲性能为代价的，或者有其他缺点而不建议使用。但我认为，这确实是Oracle一个很好的特性，特别是OLTP环境下应该使用它。并且10g中，这一特性大大增强了。

而在大型的数据仓库或者报表系统中，会有一些很大的查询作业存在，这时可以考虑使用手动管理，为某些大作业创建单独的回滚段。

以上总结了解决1555错误的各种办法，具体采用哪种方式，就需要根据错误产生的实际情况来决定了。

实例分析

实际上，你在了解了1555错误为什么会发生的前提，遇到了1555错误就不应该再手足无措了。但是，根据我个人的经验，大多数的1555错误的发生，其根本原因还是语句写得太烂，导致了大量的consistent gets和超长的执行时间，最后引发了1555错误。下面就是一个典型例子：

错误的发生

近来生产系统反馈，时常有作业被异常中止，导致应用程序被hung住。经过检查日志，是某个作业在运行时发生了1555错误，导致程序无法返回结果：
* 


ERROR at line 1: 


ORA-01555: snapshot too old: rollback segment number 9 with name "RBS08" too small 


ORA-06512: at "SQLUSER.EXT_PKG", line 4917 


ORA-06512: at line 1 
相关程序记录下的日志：
STAT-S.20060313185536.lg = 2 processed =20060313185536 end date 
Error = ORA-01555: snapshot too old: rollback segment number 9 with name "RBS08" too small 
BEGIN ext_pkg.main('extfiles','50','EAT'); END;
错误分析解决

这是一个典型的1555错误。检查引发该错误的PACKAGE，发现它只有一个入口函数main（及程序日志中记录的函数），但这个函数还调用了其他N个PACKAGE里面的函数。这是一个大作业，执行时，设置它使用了一个大的回滚段：RBS_BATCH1。

先看看相关配置：rollback tablespace空间为8G，undo_retention为1800。

看看回滚段的统计数据：
SQL> Select rn.Name "Rollback Segment", rs.RSSize/1024 "Size (KB)", rs.Gets "Gets",
  2         rs.waits "Waits", (rs.Waits/rs.Gets)*100 "% Waits",
  3         rs.Shrinks "# Shrinks", rs.Extends "# Extends"
  4  from   sys.v_$RollName rn, sys.v_$RollStat rs
  5  where  rn.usn = rs.usn;
 
Rollback Segment                Size (KB)       Gets      Waits    % Waits  # Shrinks  # Extends
------------------------------ ---------- ---------- ---------- ---------- ---------- ----------
SYSTEM                                952       1189          0          0          0          0
RBS_BATCH1                         409592     681422        667 .097883544          0          0
... ...S_BATCH1 getshould be provided by DEV Team.r.and CSS_EMAN_INDX exist.eir 
RBS10                              204792     478502         10 .002089855          0          0
RBS11                              204792     477366          8 .001675863          0          0
RBS12                              204792     491070          6 .001221822          0          0
RBS_BATCH2                         409592     650088        644 .099063511          0          0
RBS_BATCH3                         409592     243849          3  .00123027          0          0
 
16 rows selected.
注意到RBS_BATCHT1的wait%是0.098%，这个值应该是比较好的一个值。

回过头再来看依法错误的语句：调用ext_pkg.main函数。在程序日志中已经记录下了输入参数，这就比较好办了：作一个trace，看看到底哪条语句的性能最差：
SQL> alter system set events ‘10046 trace name context forever, level 1’;
 
System altered.
 
SQL> exec ext_pkg.main('extfiles','50','EAT');
 
PL/SQL procedure successfully completed.
 
SQL> alter system set events ‘10046 trace name context off’;
 
System altered.
 
用tkprof处理trace文件后检查trace文件：
TKPROF: Release 9.2.0.5.0 - Production on Tue Mar 14 09:21:58 2006
 
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
 
Trace file: prod_ora_29225.trc
Sort options: default
 
********************************************************************************
count    = number of times OCI procedure was executed
cpu      = cpu time in seconds executing
elapsed  = elapsed time in seconds executing
disk     = number of physical reads of buffers from disk
query    = number of buffers gotten for consistent read
current  = number of buffers gotten in current mode (usually for update)
rows     = number of rows processed by the fetch or execute call
********************************************************************************
 
alter session set events='10046 trace name context forever, level 1'
 
... ...
 
BEGIN
ext_pkg.main('extfiles','50','EAT');
END;
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.04          4         39          0           0
Execute      1      0.20       0.21      24560      37808          0           1
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.21       0.26      24564      37847          0           1
 
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 30
********************************************************************************
 
... ...
 
 
SELECT QID
FROM
 CSQUE QUE, ASTP STP WHERE
  QUE.CDE IN (:B3, :B2) AND QUE.TID =
  STP.TID AND STP.ACDE = :B1 ORDER BY
  QUE.CDE, QUE.DT
 
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      2      0.00       0.00          0          0          0           0
Fetch     2682      0.41       0.54       7557      10968          0        2680
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     2685      0.41       0.54       7557      10968          0        2680
 
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 31     (recursive depth: 1)
********************************************************************************
 
... ...
 
 
    7  user  SQL statements in session.
   41  internal SQL statements in session.
   48  SQL statements in session.
********************************************************************************
Trace file: prod_ora_29225.trc
Trace file compatibility: 9.02.00
Sort options: default
 
       1  session in tracefile.
       7  user  SQL statements in trace file.
      41  internal SQL statements in trace file.
      48  SQL statements in trace file.
      20  unique SQL statements in trace file.
    8626  lines in trace file.
 
从trace文件中，发现有一条语句性能相当差，通过对这条语句做SQL Trace，发现它的consistent gets达到80万！

于是对该语句进行优化，调整了它的写法，并建立了缺少的索引（优化过程略）。最终将consistent gets数量降低到了5000。

重新安排上线，经过一周的观察，1555错误没再发生。

其实这个案例的解决是比较简单的，最终的处理就是将一条语句进行优化。

^_^

上一篇：数据段压缩（Data Segment Compression）浅析

下一篇：ORA-01555错误浅析(2)

本类中的所有文章