ORA-04031這個錯誤,幾乎每一個專業的DBA都遇到過。這是一個相當嚴重的錯誤,Oracle進程在向SGA申請內存時,如果申請失敗,則會報這個錯誤。大部分情況下是在向SGA中的shared pool申請內存時失敗,而少有向large pool等池中申請內存失敗。比如下面的報錯:
- Wed Apr 27 16:00:25 2011
- Errors in file /oracle/app/oracle/admin/zxin/bdump/zxin1_ora_2052294.trc:
- ORA-04031: unable to allocate 4128 bytes of shared memory
- ("shared pool","unknown object","sga heap(3,0)","kgllk hash table")
這裏很清楚地表示出來,是在向shared pool申請內存時失敗。
shared pool內存申請(分配)失敗,通常有如下的幾種可能:
- shared pool過小,比如在SGA Manual Management方式下,shared pool設置過小。比如一套數千連接的大系統,shared pool只設置了幾百M。這種情況下,要解決問題很解單,增加shared pool的大小即可。
- 應用沒有使用綁定變量,硬解析非常多,導致shared pool內存碎片嚴重,分配大塊內存時不能獲得連續的內存空間。硬解析多的一個變種是雖然使用了綁定變量,但是由於某種原因,Cursor不能共享,導致Child Cursor非常多。實際上,如果shared pool較大(比如數GB大小),這種問題還是很少出現的,並且出現也通常出現在申請大塊內存時。這種情況如果使用alter system flush shared_pool可以暫時緩解問題。但是這條命令又通常不適用於shared pool較大而且比較繁忙的系統。使用綁定變量
- Cache的cursor很多,同時cursor_space_for_time這一參數設置爲TRUE,可能會使shared pool碎片化嚴重,導致不能分配到大塊的連續內存。
- Oracle的BUG導致內存泄露,比如在一些版本中查詢v$segment_statistics這樣的視圖導致內存泄露,使shared pool內存耗光。同樣的情形還有類似於“obj stat memory”,"gcs resources","ges resources"等。通常這類內存爲perm類型(permanet),這類內存通常是在分配時就確定了固定的用途,不能用於其他用途,因此極容易產生碎片。
- Oracle從9i開始,根據shared pool的大小將shared pool分爲多個子池(subpool),每個子池有獨立的free list,同時在分配時單獨管理(有其獨立 的shared pool latch)。Oracle的BUG或者說是內存分配策略缺陷導致某一類shared pool的內存分配只在一個子池(subpool)中,即多個子池的使用極不均衡,導致向那個使用得最多的子池申請內存時失敗。報錯信息中的"sga heap(3,0)"即指明是在第3個子池申請內存時失敗。本文案例中的ORA-04031錯誤其產生的原因可以歸結爲Oracle對shared pool的分配/使用策略問題。
- 操作系統內存不足,這隻會出現在shared pool的使用還沒有達到最大值時纔會出現,並且在操作系統都有swap的情況下,只有部分操作系統纔可能有這種情況,比如在HP-UX下,reserved 內存過多導致swap滿。
- 其他原因,多數是因爲BUG。請參考下面提及的MOS參考文檔。
本文中的案例,其數據庫是運行在AIX 5.3系統中的10.2.0.4 RAC,RAC節點數爲2。數據庫是從9i升級到10g,而目前處於正式升級前的測試階段。數據庫報的ORA-04031錯誤信息如本文前面所示(其中的數據庫名稱已經做了處理)。
在繼續講解案例之前,不得不提到MOS上的幾篇關於ORA-04031錯誤的文檔:
- Master Note for Diagnosing ORA-4031 [ID 1088239.1]
- Diagnosing and Resolving Error ORA-04031 on the Shared Pool or Other Memory Pools [Video] [ID 146599.1]
- Interpreting the automatically generated ORA-4031 diagnostic trace. [ID 809560.1]
- Troubleshooting and Diagnosing ORA-4031 Error [Video] [ID 396940.1]
- ORA-4031 Common Analysis/Diagnostic Scripts [Video] [ID 430473.1]
其實分析ORA-04031錯誤,通常有以下幾個要點:
- 判斷錯誤發生所有的內存區域,是shared pool,large pool還是streams pool等。這個很容易從錯誤信息中判斷出來,本文主要描述shared pool的ORA-04031錯誤,這也是最常見的。
- 檢查Shared Pool的總大小以及free memory的大小。如果free memory看上去挺多,以subpool爲單位檢查是否存在是由於碎片導致沒有足夠的連續內存以供分配,特別是關注報錯信息中提及的子池。
- 如果Shared Pool相較於系統規模來說足夠大(通常數GB都已經是很大的了),檢查Shared Pool中有沒有佔用非常多的內存類型或內存組件,如果有,是什麼樣的類型的內存,在各個子池之間是否分佈均勻。如果有異常佔用較多的內存類型,根據此類型在MOS上搜尋是否是會有相應的BUG引起,或者分析這種類型的內存消耗較多的原因。比如如果是sql area很大,檢查是不是硬解析特別多,或者是不是child cursor特別多引起。
- 基於以上分析的數據,來判斷shared pool內存分配失敗的原因。
上面的步驟寫得比較粗略,關於分析和解決ORA-04031問題,這裏也有一篇不錯的文章:Simplified Approach to Resolve ORA-4031
這裏關鍵的是分析Shared Pool的內存數據。ORA-04031錯誤發生後如果有條件可以馬上連接到數據庫中查詢相應的x$表和v$視圖得到相應的數據,否則只能通過ORA-4031錯誤發生時產生的trace文件。_4031_dump_bitvec這個隱含參數用於控制發生ORA-04031錯誤時對SGA的dump行爲,而trace文件的分析就不像使用SQL那樣簡單了。
下面再來詳細地分析案例:
從錯誤信息來看,很顯然,是向shared pool的第3個subpool申請內存時出錯。
以下的數據是shared pool的數據:
- SQL> select sum(bytes)/1024/1024 mb from v$sgastat where pool='shared pool';
- MB
- ----------
- 4096.53062
- SQL> SELECT KSMCHCLS CLASS, COUNT(KSMCHCLS) NUM, SUM(KSMCHSIZ)
- 2 SIZ,
- 3 To_char( ((SUM(KSMCHSIZ)/COUNT(KSMCHCLS)/1024)),'999,999.00')|
- 4 |'k' "AVG SIZE"
- 5 FROM X$KSMSP GROUP BY KSMCHCLS;
- CLASS NUM SIZ AVG SIZE
- -------- ---------- ---------- ------------
- R-freea 512 24576 .05k
- freeabl 807395 1643969848 1.99k
- recr 530728 662065240 1.22k
- R-free 256 214910976 819.82k
- free 43063 100605496 2.28k
- perm 140 1673368632 11,672.49k
雖然free的數量不是太多,但是freeable的數量還是很多的。
下面是各個子池更詳細的數據:
- SQL> SELECT KSMCHIDX,KSMCHDUR, KSMCHCLS CLASS, COUNT(KSMCHCLS) NUM, SUM(KSMCHSIZ)
- 2 SIZ,
- 3 To_char( ((SUM(KSMCHSIZ)/COUNT(KSMCHCLS)/1024)),'999,999.00')|
- 4 |'k' "AVG SIZE"
- 5 FROM X$KSMSP GROUP BY KSMCHIDX,KSMCHDUR, KSMCHCLS
- 6 order by 1,2,3;
- KSMCHIDX KSMCHDUR CLASS NUM SIZ AVG SIZE
- ---------- ---------- -------- ---------- ---------- ------------
- 1 1 R-free 27 22666392 819.82k
- R-freea 54 2592 .05k
- free 26 14024 .53k
- perm 32 430299448 13,131.70k
- 2 R-free 12 10073952 819.82k
- R-freea 24 1152 .05k
- free 10531 24519112 2.27k
- freeabl 44922 32457736 .71k
- recr 163177 134273584 .80k
- 3 R-free 9 7555464 819.82k
- R-freea 18 864 .05k
- free 1678 4555704 2.65k
- freeabl 79815 102514024 1.25k
- recr 32689 36368096 1.09k
- 4 R-free 20 16789920 819.82k
- R-freea 40 1920 .05k
- free 2182 5810056 2.60k
- freeabl 66235 254656184 3.75k
- recr 16245 58284480 3.50k
- 2 1 R-free 25 20987400 819.82k
- R-freea 50 2400 .05k
- free 23 20016 .85k
- perm 35 398418384 11,116.58k
- 2 R-free 4 3357984 819.82k
- R-freea 8 384 .05k
- free 5137 6604176 1.26k
- freeabl 40377 12140944 .29k
- recr 54942 45005024 .80k
- 3 R-free 9 7555464 819.82k
- R-freea 18 864 .05k
- free 1477 5524568 3.65k
- freeabl 79548 101879808 1.25k
- recr 32380 36033448 1.09k
- 4 R-free 21 17629416 819.82k
- R-freea 42 2016 .05k
- free 2540 7092424 2.73k
- freeabl 70133 270332800 3.76k
- recr 15924 57263032 3.51k
- 3 1 R-free 26 21826896 819.82k
- R-freea 52 2496 .05k
- free 26 20520 .77k
- perm 33 414355416 12,261.94k
- 2 R-free 4 3357984 819.82k
- R-freea 8 384 .05k
- free 4693 7053032 1.47k
- freeabl 49723 14339800 .28k
- recr 52771 42357312 .78k
- 3 R-free 11 9234456 819.82k
- R-freea 22 1056 .05k
- free 3594 9280904 2.52k
- freeabl 95823 121934488 1.24k
- recr 39643 44097504 1.09k
- 4 R-free 25 20987400 819.82k
- R-freea 50 2400 .05k
- free 2822 7291680 2.52k
- freeabl 84443 323149712 3.74k
- recr 19148 67997008 3.47k
- 4 1 R-free 27 22666392 819.82k
- R-freea 54 2592 .05k
- free 26 18088 .68k
- perm 40 430295384 10,505.26k
- 2 R-free 6 5036976 819.82k
- R-freea 12 576 .05k
- free 4818 11473920 2.33k
- freeabl 46089 39963224 .85k
- recr 54061 44188072 .80k
- 3 R-free 9 7555464 819.82k
- R-freea 18 864 .05k
- free 1427 4109504 2.81k
- freeabl 80026 102379080 1.25k
- recr 33217 36949240 1.09k
- 4 R-free 21 17629416 819.82k
- R-freea 42 2016 .05k
- free 1993 6228464 3.05k
- freeabl 70581 269087136 3.72k
- recr 16614 59372656 3.49k
KSMCHDUR是什麼意思?在9i裏面這個列的值通常爲1。實際上,Oracle從9i開始,將shared pool劃分爲多個sub pool。而在10g以上的版本中(具體開始的版本號已經不記得),每個sub pool又分了4個更小的池,我們暫且稱之爲mini heap。每個mini heap有其自己的free list。KSMCHDUR這一列即表示mini heap的編號。"heap(3,0)"中的0是指第1個mini heap。
在上面的數據中,可以看到這個子池的第1個mini heap的free已經很少,只有10來K。另外,我們可以觀察到,perm類型的內存塊只存在於每個sub pool的第1個min heap中。這個是一個重點,在後面會有解釋。
這裏本應該有通過查詢v$sgastat得到shared pool的各個組件佔用的內存分佈,只是寫BLOG時找不到了....但是我們可以在trace文件中找到數據,下面只列出sub pool 3的數據:
- ==============================
- Memory Utilization of Subpool 3
- ================================
- Allocation Name Size
- _________________________ __________
- "free memory " 81466568
- "miscellaneous " 0
- "dpslut_kfdsg " 512
- "trace buffer " 737280
- "trace_knlasg " 504
- "gcs res hash bucket " 1048576
- "gcs res latch table " 12288
- "evaluation con " 0
- "sql area " 344545752
- "UNDO STAT INFO " 59904
- "txncallback " 141744
- "transaction " 2103264
- "ges resource pools " 3968
- "sessions " 4526488
- "dlo fib struct " 128032
- "KJCTS process batching st" 240
- "row cache " 3272
- "KCB where statistics arra" 25888
- "KCB buffer wait statistic" 32000
- "KCB incremental ckpt entr" 512
- "invalid low rba queue " 1024
- "table definiti " 108704
- "temporary tabl " 4136
- "KCL instance cache transf" 131072
- "resumable " 2720
- "KESTB existence bitvec se" 128
- "type object de " 392448
- "enqueue_hash " 318960
- "KSXR pending consumption " 20192
- "KTI SGA freeable small po" 0
- "trigger defini " 885472
- "trigger source " 99264
- "trigger inform " 960
- "KTCN: Obj Invalidation Se" 2336
- "kmgsb circular statistics" 108800
- "kgl lock hash table state" 45360
- "kglsim size of pinned mem" 8024
- "kelr system metrics table" 280
- "kzctxgjsi ksuseclid memor" 117360
- "kzctxgjci ksuseclid memo" 0
- "CCursor " 95912048
- "ksr message pool free que" 188960
- "ksb ci process list (each" 144
- "ksunfy: nodes of hierarch" 320
- "ksuloi: long op free list" 256
- "kwqmncal: allocate buffer" 4048
- "ksim group query request " 0
- "ksuxds ksuseclid memory " 0
- "call " 87304
- "dictionary cache " 0
- "KSXR pending reply queue " 255488
- "hng: All sessions data fo" 0
- "ksfv subheap descriptor " 184
- "gcs resources " 169082312
- "gcs affinity " 8320
- "gcs opaque in " 12312
- "PCursor " 50743128
- "ges resource " 539376
- "fdrec_kffsg " 24
- "work area tab " 80640
- "kglsim main lru count " 38400
- "plwpil:wa " 4264
- "grptab_kfgsg " 2464
- "AW SGA " 40
- "KEWS sesstat seg tbl " 8
- "kebm slave descriptors " 1456
- "kglsim hash table bkts " 1048576
- "KSXR global channels " 1288
- "ges enqueues " 17333720
- "PLS chunk " 352
- "KSQ event description " 1440
- "KESTB existence bitvec " 4096
- "gcs shadows " 101246344
- "qmtb_init_data " 224
- "Core dump directory " 264
- "sort segment handle " 7480
- "SERVICE NAME ENTRY " 48
- "PQ/BizCard " 1536
- "qtree_kwqbspse " 40
- "latch descriptor table " 1576
- "recovery domain " 29856
- "parameters " 30056
- "SHARED SERVERS INFO " 240
- "qtree_kwqbsgn " 40
- "post agent " 0
- "pspool_kfsg " 80
- "plwsppwp:wa " 0
- "PL/SQL DIANA " 14050624
- "segmented arrays " 2072
- "Checkpoint queue " 4097024
- "sim lru segments " 2560
- "sim segment hits " 2560
- "sim state object " 40
- "partitioning d " 199616
- "ASH buffers " 8388608
- "message pool freequeue " 276336
- "PL/SQL MPCODE " 4499360
- "PL/SQL PPCODE " 3984944
- "procs: ksunfy " 1512000
- "primem_kfmdsg " 1032
- "SYSTEM PARAMETERS " 76624
- "object queue hash buckets" 262656
- "object queue hash table d" 7552
- "object level stats hash t" 512
- "object stat dummy statprv" 144
- "sim cache sizes " 320
- "logout storm management " 24000
- "pl/sql source " 21256
- "sys event stats " 199136
- "parameter handle " 67896
- "Parameter Handle " 1656
- "channel handle " 828672
- "API blockers array " 64
- "PARAMETER TABLE " 2048
- "PARAMETER ENTRY " 8
- "LGWR post requested array" 24
- "bloom filter " 3104
- "param hash values " 5984
- "sql area:PLSQL " 11477776
- "PX subheap desc " 256
- "repository " 213544
- "sql area:KOKA " 16192
- "archive_lag_target " 9624
- "state objects " 640
- "latch nowait fails or sle" 116832
- "sched job slv " 5952
- "pso tbs: ksunfy " 390000
- "dummy " 269928
- "Sort Segment " 37440
- "Cursor Stats " 6095760
- "Banner Storage " 2048
- "quiescing session " 3872
- "API data buffer " 16
- "buffer handles " 1020000
- "prmtzdini tz region " 408320
- "sga node map " 16
- "savepoints " 0
- "Managed Standby Proc Arra" 24576
- "OS proc request holder " 4664
- "db_files " 416576
- "PX server msg stats " 2288
- "KQR M PO " 283376
- "kks stats " 40
- "parameter table block " 483168
- "KSFV SGA " 824
- "plugin datafile array " 36016
- "plwda:PLW_STR_NEW_RVAL " 24
- "plwspv:PLW_STR_NEW_VAL " 16
- "KGKP sga " 32
- "BRANCH TABLE SEGMENTED AR" 70176
- "mvobj part des " 306544
- "parameter value memory " 216
- "multiblock re " 98496
- "parameter text value " 1080
- "parallel_max_servers " 8192
- "KGLS heap " 13290800
- "KGSKI sga " 80
- "resize request state obje" 368000
- "MTTR advisory " 1462832
- "monitoring co " 12480
- "rules engine aggregate st" 1416
- "krbmror " 36400
- "joxs heap " 136
- "krbmrsr " 152
- "ksfqpar " 4008
- "SGA - SWRF DrvMet Runtime" 2656
- "SGA - SWRF Metrics ksuTim" 72
- "SGA - SWRF RawMet Runtime" 1408
- "SGA - SWRF Metrics WCTime" 32
- "SQL Memory Manager Base W" 13400
- "change notification regis" 4096
- "simulator latch/bucket st" 59392
- "Prefetch history buffer " 2832
- "change notification obj m" 4096
- "KQR ENQ " 16512
- "kksss " 16464
- "API data buffer length " 0
- "kokcd " 0
- "kohsg " 8
- "Sequence Background Insta" 88
- "ksfqpn " 416
- "KGLS SP " 4704
- "knstsg " 48
- "latch classes " 352
- "system default language h" 568
- "name-service entry " 2592
- "API data buffer array " 0
- "kzull " 4096
- "kzulu " 392
- "kfgsga " 104
- "library cache " 46604712
- "kcrrny " 25320
- "spfile cleanup structure " 16760
- "xssinfo " 5952
- "buffer_pool_desc_array " 3384
- "row cache child latch " 3360
- "rm request queue link " 5320
- "SCHEDULING POLICY TABLE " 160
- "namhsh_kfdsg " 4104
- "Closed Thread SCN Bitvec " 8448
- "Client ID trace settings " 3872
- "osp allocation " 21104
- "os statistics " 9192
- "plwppwp:PLW_STR_NEW_LEN_V" 16
- "plwgc: plwgc_garbage_clea" 0
- "plwiiw: kglpql warnings " 0
- "object queue " 808080
- "obj stat memo " 599184
- "obj htab chun " 122960
- "object level " 111888
- "XCT XGA " 0
- "SGA - SWRF Metric Eidbuf " 900840
- "Processor group descripto" 64
- "Prefetch client count per" 32
- "X$SKGXPIA " 2680
- "simulator hash buckets " 2101248
- "State object subpools " 896
- "API data buffer length ma" 0
- "AWR Table Info (KEW layer" 872
- "character set memory " 4856
- "sim segment num bufs " 1280
- "character set object " 129728
- "session idle latches " 2560
- "qesmmaInitialize: " 112
- "returns from remote ops " 49152
- "name-service " 4080
- "SGA - SWRF Metric CHBs " 10912
- "listener addresses " 32
- "db_block_hash_buckets " 67108864
- "KSI resource types " 2704
- "kglsim object batch " 4196304
- "trigger condition node " 72
- "ksws service events " 18560
- "Heap0: KGL " 11642128
- "fixed allocation callback" 392
- "kqlpWrntoStr:value " 0
- "KEWS statistic name " 424
- "KEWS statistic maps " 1096
- "KCL partition table " 131072
- "kebm slave message " 88
- "kcbl state objects " 12800
- "free rm request queue lin" 0
- "xsoqsehift " 3104
- "DBWR event stats array " 192
- "kgllk hash table " 659456
- "event descriptor table " 192
- "kpssnfy: kpsssgct " 32
- "kpscad: kpscscon " 1952
- "dbwriter coalesce buffer " 3158016
- "kglsim hash table " 8208
- "gcs resource freelist dyn" 256
- "gcs shadow locks dyn seg " 256
- "kks stats latch " 160
- "KTC latch cleanup " 576
- "ges enqueue max. usage pe" 64
- "ges lmd process descripto" 2760
- "KTU latch cleanup " 2496
- "kscdnfyinithead " 16
- "X$KSVIT table " 512
- "kqlpaac:value-1 " 64
- "KCL buffer header " 192064
- "kxfpdp pointers " 28800
- "kodosgi kopfdo " 104
- "kglsim latches " 136
- "TXN TABLE SEGMENTED ARRAY" 54784
- "KJCT remote i " 1640
- "KKJ SHRD WRQS " 288
- "KJC dest ctx " 3560
- "kwrsnfy: kwrs " 1624
- "kwqmn:tskdata " 0
- "KKKI consumer " 4136
- "dbwr suspend/resume ptr a" 16
- "dbwr actual working sets " 64
- "KGSKI schedule " 0
- "temp lob duration state o" 3296
- "ges regular msg buffers " 3078008
- "jsksncb: 9 " 28672
- "Transportable DB Converte" 2552
- "KTU lat struct " 800
- "kks stats hds " 256
- "KSFD SGA I/O b " 4190248
- "HTTP fixed headers " 72
- "UNDO INFO SEGMENTED ARRAY" 649856
- "ges process hash table " 132000
- "jsksncb-latch " 1280
- "kfkid hrec " 24
- "KTCCC OBJECT " 0
- "KTPR HIST TB " 2808
- "KTF MAPPINGS " 12288
- "kksss-heap " 35136
- "kglsim heap " 3431232
- "event statistics per sess" 7665280
- "eventlist to post commits" 16
從上面的數據可以看到,第3個sub pool中,佔用較多的內存是gcs resources、gcs shadows以及sql area。但是沒有明顯的異常。
下面是第3個sub pool中第1個mini-heap中free memory的更詳細數據:
- SQL> break on ksmchidx on ksmchdur
- SQL> select
- 2 ksmchidx,ksmchdur,
- 3 case
- 4 when ksmchsiz < 1672 then trunc((ksmchsiz-32)/8)
- 5 when ksmchsiz < 4120 then trunc((ksmchsiz+7928)/48)
- 6 when ksmchsiz < 8216 then 250
- 7 when ksmchsiz < 16408 then 251
- 8 when ksmchsiz < 32792 then 252
- 9 when ksmchsiz < 65560 then 253
- 10 when ksmchsiz >= 65560 then 253
- 11 end bucket,
- 12 sum(ksmchsiz) free_space,
- 13 count(*) free_chunks,
- 14 trunc(avg(ksmchsiz)) average_size,
- 15 max(ksmchsiz) biggest
- 16 from
- 17 sys.x$ksmsp
- 18 where
- 19 inst_id = userenv('Instance') and
- 20 ksmchcls = 'free'
- 21 group by
- 22 case
- 23 when ksmchsiz < 1672 then trunc((ksmchsiz-32)/8)
- 24 when ksmchsiz < 4120 then trunc((ksmchsiz+7928)/48)
- 25 when ksmchsiz < 8216 then 250
- 26 when ksmchsiz < 16408 then 251
- 27 when ksmchsiz < 32792 then 252
- 28 when ksmchsiz < 65560 then 253
- 29 when ksmchsiz >= 65560 then 253
- 30 end ,
- 31 ksmchidx, ksmchdur
- 32 order by ksmchidx , ksmchdur
- 33 /
- KSMCHIDX KSMCHDUR BUCKET FREE_SPACE FREE_CHUNKS AVERAGE_SIZE BIGGEST
- ---------- ---------- ---------- ---------- ----------- ------------ ----------
- 3 1 5 72 1 72 72
- 13 136 1 136 136
- 27 248 1 248 248
- 48 416 1 416 416
- 56 1920 4 480 480
- 66 1680 3 560 560
- 68 4608 8 576 576
- 164 1344 1 1344 1344
- 180 1472 1 1472 1472
- 188 1536 1 1536 1536
- 190 1552 1 1552 1552
- 199 1624 1 1624 1624
- 204 1880 1 1880 1880
- 207 2032 1 2032 2032
可以看到,最大的free memory塊才2032字節,而報錯中提到的申請的內存大小爲4128字節。由於在第3個sub pool的第1個mini heap中沒有4128字節的連續free memory,所以導致內存申請失敗。
那麼這裏的問題是,爲什麼這個mini heap中的free memory那麼少?正如前面提及,爲什麼這個mini heap中的已經使用的類型全是perm類型?這個問題的答案就在於"DURATION"。Oracle在啓用了SGA自動管理的模式下,爲了便於在shared pool與buffer cache或其他內存之間動態調整大小,規定了在每一個mini heap中分配內存按照duration來進行。這裏duration可以理解爲內存塊的持久時間。perm類型的內存塊,就是分配後不能釋放,只能用於相同組件的重用。比如gcs resources這種組件的內存是perm類型,這種內存被分配後,不能釋放給sql area使用,也不能給gcs shadows使用,只能給其他的gcs resource使用。按DURATION分配內存時,perm類型的內存就只能從每個sub pool的第1個mini heap中分配。而其他類型的內存通常在sub pool的第2-4個mini heap中分配。由於perm類型的內存不能釋放,也不能被其他組件的內存重用,所以裏面的內存會越用越少,如果沒有了free memory怎麼辦?前面說到,這種模式主要是工作在SGA自動管理模式下,如果free memory沒有了,就會從SGA中的其他部分,比如buffer cache中取得memory chunk,加入到缺少內存的mini heap中。正常情況下這種機制沒有問題。
完全使用SGA自動管理有一個缺陷就是,如果應用系統綁定變量做得不好,或者由於BUG,child cursor過多,導致shared pool會變得很大,甚至超過10G,嚴重的比buffer cache還大,另一方面,在buffer cache和shared pool之間頻繁地調整大小,可能會導致嚴重的解析問題和其他性能問題。針對這個問題,通常有2種解決辦法:一種就是關閉SGA自動管理,即將SGA_TARGET設置爲0,以9i的方式來設置shared_pool_size,db_cache_size這些參數,來手動管理SGA;第二種就是sga_target仍然大於0,即自動管理SGA,但是通過設置shared_pool_size,db_cache_size等參數限制這些內存組件的最小大小,而只留給系統極少的自動調整空間。
而出現問題的這套系統,正是使用了第二種方式,開啓了SGA自動調整,但是留給自動調整的空間極少。SGA_TARGET爲35G,buffer_cache_size爲30G,shared_pool_size爲4G,再加上large_pool等組件,幾乎沒有什麼可自動調整的餘地。這種方式下,就存在了問題。下面來做一個按時間的分析:
- 時間T1,數據庫啓動,shared pool只消耗了極少量的內存。
- 時間T2至時間T3,Oracle進程請求shared pool內存,Oracle會向操作系統以指定的粒度爲單位(比如16MB)請求物理內存,加入到所請求內存所在的mini heap中。直至shared pool的大小達到shared pool最大容許的大小。這個容許大小由各參數計算而來。比如說SGA_TARGET爲10G,其他組件的參數設置後最小值爲8G,shared_pool_size的值爲1G,但是shared pool的最大容許大小爲2G。這個時候,每個sub pool的mini heap的大小已經固定。在到達shared pool最大容許大小這一階段,可能會從buffer cache等組件中佔用。
- 時間T4,Oracle進程請求shared pool內存,這個時候只能從free list或age out內存塊後獲取內存,對於sub pool的第1個mini heap,只能從free list中獲取,因爲這個mini heap中的已用內存全是perm,是不能age out的。
- 時間T5,Oracle進程請求shared sub pool中第1個mini heap的內存,但是free list中已經沒有內存。所以報ORA-04031錯誤。
在上面的時間點T5那裏,如果SGA有較大的自動調整空間,比如說完全沒有限制,即buffer_cache_size等參數很少或爲0,這樣在請求第1個mini heap中的內存時,完全可以從buffer cache中佔用,這樣的後果是使shared pool越來越大。
而本文案例的ORA-04031,正是由於SGA自動管理,而自動調整的餘地又太小,最終使sub pool的第1個mini heap空間用光。當然我們可以分析爲什麼會用光,這個就顯得更爲複雜,這跟數據量、應用系統都有很大的關係。而系統中第1次出現ORA-04031錯誤的進程,是一個job進程,而此後大部分出現的錯誤均是job進程,能檢查job代碼,發現在做大量的表的大量數據的UPDATE操作,這可能是引起gcs shadows和gcs resources大量內存使用的原因。在一套RAC數據庫中,gcs和ges相關的perm內存佔用可能會比較大。
那麼除了調整應用,應該怎麼樣解決這樣問題?這裏的解決方法是增加shared_pool_size參數到6G,同時將sga_target設置爲0,再重啓。
而另一種可能的辦法是將參數“_enable_shared_pool_durations"設置爲FALSE。這一參數爲FALSE,將會使shared pool內存分配時,不再使某一類型的內存(比如perm)必須要求在一個固定的mini heap中。而實際上,sga_target設置爲0之後,這一個參數自動會設爲FALSE(由於這一參數是靜態參數,所以修改了sga_target之後需要重啓纔會使這個隱含參數改變),所以建議的解決辦法是設置sga_target參數,而不建議修改隱含參數。當然還有一種辦法是完全讓Oracle自動管理SGA,將buffer_cache_size和shared_pool_size等參數設置爲0,但是正如前面所說,這種方法有比較大的缺陷。