Sindbad~EG File Manager
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Berkeley DB recoverability</title>
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
<link rel="up" href="transapp.html" title="Chapter 11. Berkeley DB Transactional Data Store Applications" />
<link rel="prev" href="transapp_filesys.html" title="Recovery and filesystem operations" />
<link rel="next" href="transapp_tune.html" title="Transaction tuning" />
</head>
<body>
<div xmlns="" class="navheader">
<div class="libver">
<p>Library Version 18.1.40</p>
</div>
<table width="100%" summary="Navigation header">
<tr>
<th colspan="3" align="center">Berkeley DB recoverability</th>
</tr>
<tr>
<td width="20%" align="left"><a accesskey="p" href="transapp_filesys.html">Prev</a> </td>
<th width="60%" align="center">Chapter 11. Berkeley DB Transactional Data Store Applications </th>
<td width="20%" align="right"> <a accesskey="n" href="transapp_tune.html">Next</a></td>
</tr>
</table>
<hr />
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="transapp_reclimit"></a>Berkeley DB recoverability</h2>
</div>
</div>
</div>
<p>
Berkeley DB recovery is based on write-ahead logging. This
means that when a change is made to a database page, a
description of the change is written into a log file. This
description in the log file is guaranteed to be written to
stable storage before the database pages that were changed are
written to stable storage. This is the fundamental feature of
the logging system that makes durability and rollback work.
</p>
<p>
If the application or system crashes, the log is reviewed
during recovery. Any database changes described in the log
that were part of committed transactions and that were never
written to the actual database itself are written to the
database as part of recovery. Any database changes described
in the log that were never committed and that were written to
the actual database itself are backed-out of the database as
part of recovery. This design allows the database to be
written lazily, and only blocks from the log file have to be
forced to disk as part of transaction commit.
</p>
<p>
There are two interfaces that are a concern when
considering Berkeley DB recoverability:
</p>
<div class="orderedlist">
<ol type="1">
<li>
The interface between Berkeley DB and the operating
system/filesystem.
</li>
<li>
The interface between the operating
system/filesystem and the underlying stable storage
hardware.
</li>
</ol>
</div>
<p>
Berkeley DB uses the operating system interfaces and its
underlying filesystem when writing its files. This means that
Berkeley DB can fail if the underlying filesystem fails in
some unrecoverable way. Otherwise, the interface requirements
here are simple: The system call that Berkeley DB uses to
flush data to disk (normally fsync or fdatasync), must
guarantee that all the information necessary for a file's
recoverability has been written to stable storage before it
returns to Berkeley DB, and that no possible application or
system crash can cause that file to be unrecoverable.
</p>
<p>
In addition, Berkeley DB implicitly uses the interface
between the operating system and the underlying hardware. The
interface requirements here are not as simple.
</p>
<p>
First, it is necessary to consider the underlying page size
of the Berkeley DB databases. The Berkeley DB library performs
all database writes using the page size specified by the
application, and Berkeley DB assumes pages are written
atomically. This means that if the operating system performs
filesystem I/O in blocks of different sizes than the database
page size, it may increase the possibility for database
corruption. For example, assume that Berkeley DB is writing
32KB pages for a database, and the operating system does
filesystem I/O in 16KB blocks. If the operating system writes
the first 16KB of the database page successfully, but crashes
before being able to write the second 16KB of the database,
the database has been corrupted and this corruption may or may
not be detected during recovery. For this reason, it may be
important to select database page sizes that will be written
as single block transfers by the underlying operating system.
If you do not select a page size that the underlying operating
system will write as a single block, you may want to configure
the database to use checksums (see the <a href="../api_reference/C/dbset_flags.html" class="olink">DB->set_flags()</a> flag for
more information). By configuring checksums, you guarantee
this kind of corruption will be detected at the expense of the
CPU required to generate the checksums. When such an error is
detected, the only course of recovery is to perform
catastrophic recovery to restore the database.
</p>
<p>
Second, if you are copying database files (either as part
of doing a hot backup or creation of a hot failover area),
there is an additional question related to the page size of
the Berkeley DB databases. You must copy databases atomically,
in units of the database page size. In other words, the reads
made by the copy program must not be interleaved with writes
by other threads of control, and the copy program must read
the databases in multiples of the underlying database page
size. On Unix systems, this is not a problem, as these
operating systems already make this guarantee and system
utilities normally read in power-of-2 sized chunks, which are
larger than the largest possible Berkeley DB database page
size. Other operating systems, particularly those based on
Linux and Windows, do not provide this guarantee and hot
backups may not be performed on these systems by reading data
from the file system. The <a href="../api_reference/C/db_hotbackup.html" class="olink">db_hotbackup</a> utility should be used on
these systems.
</p>
<p>
An additional problem we have seen in this area was in some
releases of Solaris where the cp utility was implemented using
the mmap system call rather than the read system call. Because
the Solaris' mmap system call did not make the same guarantee
of read atomicity as the read system call, using the cp
utility could create corrupted copies of the databases.
Another problem we have seen is implementations of the tar
utility doing 10KB block reads by default, and even when an
output block size was specified to that utility, not reading
from the underlying databases in multiples of the block size.
Using the dd utility instead of the cp or tar utilities (and
specifying an appropriate block size), fixes these problems.
If you plan to use a system utility to copy database files,
you may want to use a system call trace utility (for example,
ktrace or truss) to check for an I/O size smaller than or not
a multiple of the database page size and system calls other
than read.
</p>
<p>
Third, it is necessary to consider the behavior of the
system's underlying stable storage hardware. For example,
consider a SCSI controller that has been configured to cache
data and return to the operating system that the data has been
written to stable storage, when, in fact, it has only been
written into the controller RAM cache. If power is lost before
the controller is able to flush its cache to disk, and the
controller cache is not stable (that is, the writes will not
be flushed to disk when power returns), the writes will be
lost. If the writes include database blocks, there is no loss
because recovery will correctly update the database. If the
writes include log file blocks, it is possible that
transactions that were already committed may not appear in the
recovered database, although the recovered database will be
coherent after a crash.
</p>
<p>
If the underlying hardware can fail in any way so that only
part of the block was written, the failure conditions are the
same as those described previously for an operating system
failure that writes only part of a logical database block. In
such cases, configuring the database for checksums will ensure
the corruption is detected.
</p>
<p>
For these reasons, it may be important to select hardware
that does not do partial writes and does not cache data writes
(or does not return that the data has been written to stable
storage until it has either been written to stable storage or
the actual writing of all of the data is guaranteed, barring
catastrophic hardware failure — that is, your disk drive
exploding).
</p>
<p>
If the disk drive on which you are storing your databases
explodes, you can perform normal Berkeley DB catastrophic
recovery, because it requires only a snapshot of your
databases plus the log files you have archived since those
snapshots were taken. In this case, you should lose no
database changes at all.
</p>
<p>
If the disk drive on which you are storing your log files
explodes, you can also perform catastrophic recovery, but you
will lose any database changes made as part of transactions
committed since your last archival of the log files.
Alternatively, if your database environment and databases are
still available after you lose the log file disk, you should
be able to dump your databases. However, you may see an
inconsistent snapshot of your data after doing the dump,
because changes that were part of transactions that were not
yet committed may appear in the database dump. Depending on
the value of the data, a reasonable alternative may be to
perform both the database dump and the catastrophic recovery
and then compare the databases created by the two methods.
</p>
<p>
Regardless, for these reasons, storing your databases and
log files on different disks should be considered a safety
measure as well as a performance enhancement.
</p>
<p>
Finally, you should be aware that Berkeley DB does not
protect against all cases of stable storage hardware failure,
nor does it protect against simple hardware misbehavior (for
example, a disk controller writing incorrect data to the
disk). However, configuring the database for checksums will
ensure that any such corruption is detected.
</p>
</div>
<div class="navfooter">
<hr />
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left"><a accesskey="p" href="transapp_filesys.html">Prev</a> </td>
<td width="20%" align="center">
<a accesskey="u" href="transapp.html">Up</a>
</td>
<td width="40%" align="right"> <a accesskey="n" href="transapp_tune.html">Next</a></td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Recovery and filesystem operations </td>
<td width="20%" align="center">
<a accesskey="h" href="index.html">Home</a>
</td>
<td width="40%" align="right" valign="top"> Transaction tuning</td>
</tr>
</table>
</div>
</body>
</html>
Sindbad File Manager Version 1.0, Coded By Sindbad EG ~ The Terrorists