2011-09-12

direct I/O for Oracle Databases

Pros and Cons of Using Direct I/O for Databases [ID 1005087.1]


Applies to:

Solaris x64/x86 Operating System - Version: 8 6/00 U1 and later   [Release: 8.0 and later ]
Solaris SPARC Operating System - Version: 8.0 and later    [Release: 8.0 and later]
Oracle Solaris Express - Version: 2010.11 and later    [Release: 11.0 and later]
All Platforms

Goal

By default, filesystem I/O is buffered in the kernel both through segmap and the page cache. Depending on use, this may not be the optimal configuration for data, and careful consideration and use of filesystem buffering, and direct I/O should be taken when setting up a system for running a database.

Solution

Drawbacks of using buffered filesystems:
  • There is no limit on amount of memory used by the filesystem cache - it sucks up any free memory available. Any modified (dirty) page in the filesystem cache has to be written out to a stable storage. A busy workload which is dirtying pages quickly may have to compete with the kernel for system resources since writing dirty pages to stable storage is a kernel operation that can be CPU and I/O-intensive and may be disruptive to the workload. fsflush, responsible for examining portion of memory and writing modified pages to a stable storage, can monopolize CPU resources on large memory configuration. Pages are written if they are modified and do not meet one of the following conditions:
    • kernel pages
    • free pages
    • locked pages
    • associated with swap device, such as tmpfs pages
    • currently involved in an I/O operation
  • NOTE: mmaped pages with write permission which have been modified are also flushed. For details on how the page cache is managed by Solaris, see Document 1003383.1 Understanding Cyclic Caching and Page Cache on Solaris 8 and Above.
  • Segmap, used by the filesystem for non-mapped I/O, i.e., read() and write(), may not scale well for an application with large working set due to locking contention, as described in bugid: 4715927 (fixed in Solaris 10). For filesystem read()/write(), Solaris caches information about pages into two places: the page cache and the kernel's segmap. When using read() or write(), there is no user address mapping of the page, so the kernel builds its own virtual address mapping to the page in segmap in order to read() or write() the contents of the page. When the page is not currently mapped in segmap, the kernel must build the segment mapping on the fly - that means extra cpu cycles, TLB shootdown (higher cross calls) of the previous usage of the address in segmap and mutex contention in the kernel segmap structures. That slows down the application read() and write() operations.

    NOTE: For memory mapped I/O (i.e. mmap()), though segmap is not used, the application can still dirty pages in the page cache - even more quickly since it can write to any pages of the file without direct kernel intervention, such as a a system call. Such processes can dirty as much memory as they want. If no msync() calls are issued, it is entirely up to fsflush to write the dirty pages, and this can potentially bring the system to a low memory condition.
  • The sync() system call causes all dirty pages in the page cache to be scheduled for I/O, subject to queuing limitations enforced by maxpgio tunable. In large memory configurations, sync() needs to traverse more pages looking for candidates for writing. It takes longer if there is more memory configured even if the resulting I/O performed is not as much. Calling sync() in an application is a bad idea since it causes all dirty pages to be written, not just modified data for the application. Applications should instead use: msync(), memcntl(), fsync(), fdatasync(), fflush() which are considered low impact and flushes only data that application is interested in writing instead of all dirty pages in the system. Another option is to use one of the open() flags, such as O_DSYNC which specify that operations to the file are synchronous, and require no flushing.
  • The POSIX-required single writer lock on files in a filesystem can become an issue when multiple process are trying to update a file. Multiple readers can execute in parallel, however when a write is being issued to a particular file, other I/O operations are delayed (regardless if they are read or write) until that write completes.
  • The filesystem caches pages for write, but will also cache pages for reads. When a full filesystem block is written for the first time, no in-core pages are created and no segmap slots are reserved. It is when the allocated block is read back in memory that it is cached and a segmap entry is created.
    NOTE: In case of the Oracle redo log, a write operation that is not on a clean 8KB boundary and the data isn't pre-cached on the storage array (a very common situation), will cause the system to read in the block from the storage first before it can be written out. This is a side effect of the O_DSYNC flag used by Oracle.
  • Filesystems use main memory, called the page cache, to cache filesystem pages. I/O is done using 8KB pages, so copying large files to/from the filesystem or reading large database tables into memory causes memory to be fragmented into 8KB pages. High filesystem paging activity can potentially exhaust all large pages in the system. Applications requesting larger page sizes may have to wait longer to allocate a larger page, since smaller sized pages may need to be coalesced into a large page. In some cases, the request for a large page may not be fulfilled at all, and the application's request for memory is satisfied with smaller pages. This may cause more frequent TLB misses and poorer use of the TLB. For more detail see Document 1003483.1 Oracle database restart takes longer on high end systems running Solaris.
  • When the application block size is greater than the filesystem block size, it causes unnecessary overhead per application I/O request, since single I/O request at the application layer must be broken down into multiple filesystem I/O requests.
  • All Oracle writes are synchronous because it opens all data files with the O_DSYNC flag. That means write must be committed to the stable storage before it is counted as complete. Oracle write operations in case of buffered filesystem involves extra copy that tends to slow down Oracle writes.
  • When an application does not set the O_SYNC or O_DSYNC flag on a file descriptor, it writes to the page cache. The filesystem and fsflush together write these data blocks to the stable storage at some later time. Data integrity is compromised if system crashes in the middle of writing out dirty pages. For writes to a UFS filesystem, they are deferred until clustersize (clustersize is based on tunefs() maxcontig parameter, and default value is 1 MB (128 fs blocks)) has been written by the application. UFS does not write partially filled clusters and relies on fsflush to write those to stable storage. When a new cluster is initiated, an I/O is issued for the previously written data; If a write() breaks the sequentiality, (e.g. a write() comes after an lseek()) UFS will initiate the I/O for all previously written data and start a new cluster with the data from the current write(). The difference between sequential and random I/O is that the sequential access pattern allows the amount of deferred I/O to grow to it's maximum size. UFS keeps a single slot of deferred I/O. All file data blocks are, however, guaranteed to be flushed to stable storage on a close() of the file descriptor. Metadata is not flushed immediately unless you specify the open with a O_SYNC flag. Oracle does I/O with the O_DSYNC flag and ensure all filesystem blocks are allocated prior to operation (except for the temp tablespaces). This way, even if the metadata is not flushed, no harm is done to the file in the event of system crash since the file structure is still intact.
Advantages of direct I/O:
Direct I/O is a filesystem feature that can be enabled via the forcedirectio mount option or within an application for a single file by calling directio(). Direct I/O offers several benefits:
  • Retain filesystem administrator model similar to buffered filesystem.
  • Provides near-device performance with support of zero copy. Zero copy means it has the ability to DMA from the I/O controller to the user buffer without an overhead of copying through the kernel.
  • UFS direct I/O bypasses, if possible, the filesystem cache. This eliminates the paging interaction with the virtual memory subsystem.
  • Unlike when using a buffered filesystem, direct I/O does not cause data to be prefetched. For example: a 2KB read by an application will result in a 2KB I/O instead of an 8KB read as in the case of a buffered filesystem. This results in relatively less use of memory. An example of this is Oracle, which runs better when it manages its memory directly. Using the query option in Oracle allows it to do its own prefetching, which is superior to the kernel guessing what should be prefetched.
  • 64-bit Oracle can support a very large SGA (Shared Global Area) size. Data blocks cached in SGA are accessible to all instances of the database. Caching these data blocks in the filesystem cache (double buffering), as in the case of a buffered filesystem, wastes memory and puts unnecessary strain on the Solaris virtual memory subsystem. The overhead of going through the filesystem code for file access (even if the data is in the page cache) is still significant as compared to accessing data in the SGA.
  • I/O's can be any multiple of 512 bytes
  • Avoids the breakup of O_SYNC writes
  • Unlike a buffered filesystem, UFS direct I/O allows concurrent writes to the file by multiple processes. Applications are responsible for providing appropriate file locking for these concurrent writes.
Limitations of Direct I/O:
There are some drawbacks to direct I/O that should be considered when
deciding whether to use it or not:
  • Only available for regular files
  • Files with holes cannot use direct I/O
  • No buffered read-ahead. Backups or application rely on read ahead may suffer a performance hit when using direct I/O.
  • Single stream reads typically used for Database table scan may suffer due to absence of filesystem pre-fetching.
  • Writes must be sector-aligned
  • File must not be mmap()ed (see bug 4362469 ).
  • Writes cannot extend the allocated blocks by extending the file or filling in holes. This means direct I/O does not work when creating a new file, as in the case of cp. Applications must be aware of this behavior with direct I/O. This limitation does not affect databases, like Oracle, because they pre-allocate the data blocks (except temp tablespace) prior to writing it.
Guidelines and Tips:
  • When using a buffered filesystem with large mmap()ed files that are being modified frequently, consider moving those files to tmpfs storage, such as /tmp, to reduce dirty page write activity. Those files would need to both fit in memory, and not be required to survive reboots. This will help reduce bursts of I/O. If that is not possible, then processes that have root privilege can lock down the mmap()ed memory with mlock(3C). That prevents fsflush from issuing I/O. If neither are possible, then increasing autoup and decreasing tune_t_fsflushr will have the effect of spreading the flushing of data. The trade-off is that the data that is at risk due to a crash is extended to autoup seconds worth of data.
  • Segmap is a virtual memory address consumer only. It is the kernel address space used to map pages for non-mapped I/O - i.e. read(2) and write(2). Depending on your application working set and access pattern, a smaller sized segmap may mean increased segmap misses and a performance penalty, even though the page is in the cache. If your working set is huge, and your I/O access pattern means you often miss the segmap, then the size of segmap has no effect on performance. In Solaris 8 and above, the size of segmap can be tuned (segmap_percent, which defaults to 12% of total memory) to a larger value as needed. Having a large segmap, in some cases, improves the efficiency of file I/O using the page cache. You can think of segmap as the fast filesystem cache. Access to cached filesystem pages is faster when the page is mapped in segmap and present in the page cache than it is if the page is not mapped in segmap, but in the page cache (that is, still in memory, but not in the segmap segment).

    NOTE: In Solaris 10 kernel (SPARC only), the segmap driver now leverages segkpm (which provides a kernel virtual address for all physical memory). This means that there is far less overhead to providing a segmap address for a mapping of a filesystem page, and reduces cross-calls required to shoot-down old mappings for those addresses. Without this enhancement, cached read(2) is almost 4x slower if it misses the segmap. It is not yet available for Solaris x86/x64.
  • UFS has a "better" direct I/O facility relative to VxFS, due to its write concurrency feature discussed earlier. VxFS offers write concurrency with Veritas Quick I/O (QIO) and ODM optional products, which also allows per-file control over buffering.
  • Setting the Oracle tunable '_filesystemio_options=setall' with Oracle on UFS causes Oracle to use direct I/O on all its files. This has the advantage that it only effects Oracle's use of the files, and won't affect the behavior of backup software or applications relying on filesystem read-ahead. The downside of this setting is that it affects all Oracle data files and thus does not provide the tuning possibilities offered by VERITAS QIO and ODM products where direct I/O can be set on a per file basis.
  • The VXFS filesystem supports several options for unbuffered I/O operation. Two most frequently used options are:
    • convosync=direct: This option causes direct I/O to be used on files that the application opens with the O_SYNC or O_DSYNC flags (Oracle uses this flag). Other applications are not effected and continue to use buffered I/O.
    • mincache=direct: All file I/O operations use direct I/O. Setting this option may negatively effect applications like: cp, ufsdump, cpio, etc. due to absence of read-ahead.
  • Using the asynchronous I/O library (libaio), programs can issue read()s and write()s via aioread/aiowrite(3AIO) such that control is returned to the calling process without waiting for the read or write to complete. This can allow the program to issue I/Os in parallel, only waiting for their completion when the program wants to. This is in contrast to fread/fwrite(3C) which initiate blocking I/O.
    NOTE: When using AIO there is no need to increase number of db_writer processes. A single db_writer process using AIO is capable of handling thousands of concurrent I/O operations.
  • Oracle tunables for setting direct I/O and asynchronous I/O:
          filesystemio_options=none (Disables asynchronous I/O and direct I/O)
          filesystemio_options=asynch (Enable asynchronous I/O only)
          filesystemio_options=directIO (Enable direct I/O on all Oracle files)
          filesystemio_options=setall (Enable asynchronous I/O and direct I/O on Oracle files)

    NOTE: When enabling direct I/O in Oracle it is not needed to use filesystem direct mount option.
  • Setting the Oracle tunable db_file_multiblock_read to a value greater than or equal to 32 causes Oracle to use pread() instead of readv(), which can substantially improve database throughput. The pread() function performs the same action as readv(), except that it reads from a given position in the file without changing the file pointer. This makes a big difference When multiple threads/processes are performing non-sequential I/O operations like lseek() and read() concurrently on the same file descriptor. This applies to both buffered or UFS direct I/O.

    NOTE: Oracle calculates the read size for full table or index scan operations using the equation: db_file_multiblock_read x db_block_size (default 8KB). It means a setting of 32 for db_flile_multiblock_read would generate 256KB-sized I/Os. However, some versions of Oracle set it to 128KB by default. Use truss(1) to verify the size Oracle is using for read operations.
  • Key Points: to remember when setting up an Oracle Database:
  1. Use direct I/O for Oracle redo logs.
  2. Configure a large SGA size to avoid unnecessary physical reads. Setting db_block_checksum=TRUE (default with Oracle 9i) would mean re-validating the checksum for every read miss and that becomes an issue with Oracle rollback and index I/O performance. A large SGA means better read hits and less overhead associated with checksum validation.
  3. Keep sufficient memory(sort_area_size) for Oracle sort operations. That avoids disk I/O when Oracle performs sorting operations.
  4. Place tables (especially write- tables) on direct I/O-enabled filesystems.
  5. Use buffered filesystems for Oracle archive logs, since pre-fetching at the filesystem level helps sustained sequential reads.
Benefits of Buffered Filesystems:
  • The page cache contains all file pages that are held in memory. All page-in and page-out operations occur between this cache and the underlying filesystems on disks. The page cache uses the system memory as a cache for all filesystems. This means that written data is acknowledged very quickly (basically at memcpy speed) to applications. Read operations that hit the page cache also operate at memcpy speed.
  • Write Coalescing: Multiple writes to a file will results in fewer larger physical I/Os. Write cancellation happens when there are multiple writes (from one or more processes) to the same file. The last write cancels previous ones and this leads to fewer disk writes. It is due to the page cache that the I/O size sent to the devices can be bigger than the application's I/O size. This can be a significant benefit of using the page cache for some workloads.
  • The read-ahead feature of filesystems can improve database index scan performance.
  • Sharing memory via the page cache for consolidation purposes is much easier than manually distributing available memory to all the database instances.
  • Oracle DIRECT PATH writes: In some cases, Oracle relies on deferred (non-O_DSYNC) writes to achieve optimum performance, such as the DIRECT PATH write operation. Oracle issues multiple deferred writes assuming the filesystem page cache is present, then issues an fsync() to commit the changes. These operations go slowly without filesystem buffering. UFS direct I/O writes in synchronous fashion and thus generates more physical I/O. With direct I/O latency associated with underlying I/O subsystem, this can become an issue.

    NOTE: fsync(3) performance was improved in Solaris 9. If the workload is dependent on "DIRECT PATH" writes, consider upgrading to Solaris 9.
  • The best way to do synchronous writes to disk (especially for large writes) is not to open the file w/ D_SYNC or O_SYNC, but to call fsync() on the file descriptor at every point at which one needs to make sure the data gets written to disk. fsync() is much faster in Solaris 10 than it was previously; fsync() on files that have no dirty pages is very fast, unlike in previous releases where it's speed was a function of the number of file pages (clean or dirty) in memory...
Monitoring Tools:
  • Directiostat: The directiostat command is a tool that dumps out the kstat information for direct I/O calls to UFS filesystems. It is available for download
    >directiostat
    lreads lwrites preads pwrites Krd Kwr holdrds nflush
* lreads: logical reads to the UFS via directio
   * lwrites: logical writes to the UFS via directio
   * preads: physical reads to media
   * pwrites: physical writes to media
   * Krd: kilobytes read
   * Kwr: kilobytes written
   * nflush: number of cached pages flushed
   * holdrds: number of times the read was a "hole" in the file.
  • Segmap Thrashing: Buffered filesystems become an issue when the CPU starts spending more time in kernel mode (%sys >50) and kernel profiling data reports high occurrences of segmap_fault() and pvn_vplist_dirty() events. 
    NOTE: For details on how to debug high system/kernel time see Document 1008930.1 Debugging a High System or Kernel Time in Solaris[TM].
  • Kernel debugging tools such as mdb have utilities to dump kernel stack of active threads.
    • adb/mdb: $
When the kernel stack of an Oracle thread in pread() shows similar functions to the ones below, Oracle's read() operation is resulting in segmap thrashing. Relevant stack frames are highlighted thus.
biowait+0x5c(318514c8b98, 318514c8b98, 0, 13, 1480248, 78437200)
vx_biowait+4(318514c8b98, 78437a48, 1, 3003767aad8, 318514c8b98,..)
vx_io_wait+0x14(318514c8b98, 84051, 3003767aad8, 0, 3003767aad8, 40)
vx_io_ext+0x44(3006cd14b10, 3005bc83530, 2cdcd0, 92000, 70035651870, ..)
vx_nalloc_getpage+0x208(2000, 0, 2000, 1, e000, 0)
vx_do_getpage+0x940(0, 1, 0, 1, 2a105aef510, 2a105aef354)
vx_getpage1+0x594(2a105aef510, 3b094000, 2000, 1, 1, 2a105aef43c)
vx_getpage+0x44(30069ee1e70, 3b586000, 2000, 2a105aef52c, 2a105aef510,.)
segmap_fault+0x10c(30003cf5290, 300011cfef8, 3b586000, 2000, 0, 1) <---
segmap_getmapflt+0x408(30001a07f38, 3b596000, 3b586000, 2000, 1, 1570a)
vx_segmap_getmap+0x40(2a105aef7ac, 3b586000, 2000, 1, 2a105aef7ac, 0)
vx_cache_read+0x338(3006cd14b10, 0, 0, 1fff, 0, 2a105aefa20)
vx_read1+0x574(2000, 0, 0, 30065e590f0, 8, 2a105aefa20)
vx_read+0x23c(3006cd14b10, 30065e590f0, 0, 1, 2a105aefa20, 30069ee1e70)
pread+0x28c(3b586000, 30fa823d0a0, 0, 2043, 2000, 3df)
syscall_trap+0x88(3df, 4b1920000, 2000, 3b586000, 2000, 3df)
  • Single Writer Lock: When the kernel stack of an Oracle thread in pwrite() shows stack frames similar to the ones below, write() operations in Oracle are hitting a filesystem single writer lock.
vx_rwsleep_rec_lock+0x5c(301ea2c4ee0, 301ea2c4ed8, 301ea2c4ec0, 4,..0)
vx_recsmp_rangelock+8(30069043e80, 2a1038a79e8, 0, 4, 0, 4)
vx_irwlock+0x28(30069043a68, 4, 0, 2a1038a79e8, 8, 2)<------
vx_write+0x2e4(2a1038a7ab8, f001, 4e1f6000, 3006d567ca0, 1, 0)
pwrite+0x248(3134510baf8, 2043, 1, 4e1f4000, ffffffff7c30da3c, 0)
syscall_trap+0x88(4db, 652128000, 2000, 4e1f4000, ffffffff7c30da3c, 0)
  • Amount of Dirty pages: mdb provides information about number of dirty pages in the system using the memstat dcmd. Note that this is only available in Solaris 9 and above.
    % mdb -k
    Loading modules: [ unix krtld genunix ip s1394 usba wrsm ipc random nfs
    ptm logindmux cpc sppp nca wrsmd ]
    >> ::memstat
    Page Summary Pages MB %Tot
    >>
          Kernel 462366 3612 6%
          Anon 185051 1445 2%
          Exec and libs 11894 92 0%
        *Page cache 922689 7208 11% <
          Free (cachelist) 1121939 8765 14%
          Free (freelist) 5358354 41862 66%
          Total 8062293 62986
Solaris 8 and below can use prtmem utility part of memtool package. Tool is available for download
           prtmem (summary of where the system memory is used)
          Total memory: 180 Megabytes
          Kernel Memory: 30 Megabytes
          Application memory: 82 Megabytes
          Executable memory: 38 Megabytes
          Buffercache memory: 22 Megabytes
          Free memory: 5 Megabytes
  • segmap statistics
    Overall segmap statistics may be obtained by using the command kstat -n segmap. To calculate segmap efficiency since boot, use the formula:
(get_reclaim + get_use) / getmap
These may be obtained at set intervals via kstat unix:0:segmap:get\* 1. Another interesting measure is the rate of increase (the difference between 2 successive sets of data from the kstat command) of:
getmap - (get_reclaim + get_use)
which is the number pages per second which are missing in segmap. This is a method of measuring whether tuning segmap_percent is helping segmap thrashing. If this number is small, no benefit should be expected from increasing 'segmap_percent'' in the first place.
We can associate an approximately 20% increase in performance on those pages if and only if we manage to keep them in segmap by increasing segmap_percent. There is no guarantee that increasing the size of segmap will cause the rate of segmap misses to go down-it is workload dependent. For example, if the number increases by 1000 pages per second; then that corresponds to 1000*8KB pages or 8MB/sec of I/O that may improve performance by 20% if increasing segmap_percent causes the data to stay mapped in segmap.
  • The statit() utility also reports segmap usage.
  • Oracle STATSPACK: it offers wealth of Oracle stats including wait events statistics that tells what Oracle is waiting for.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community, Oracle Solaris Kernel Community.

Niciun comentariu:

Trimiteți un comentariu