Sindbad~EG File Manager
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Transactional guarantees</title>
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
<link rel="up" href="rep.html" title="Chapter 12. Berkeley DB Replication" />
<link rel="prev" href="rep_bulk.html" title="Bulk transfer" />
<link rel="next" href="rep_lease.html" title="Master leases" />
</head>
<body>
<div xmlns="" class="navheader">
<div class="libver">
<p>Library Version 18.1.40</p>
</div>
<table width="100%" summary="Navigation header">
<tr>
<th colspan="3" align="center">Transactional guarantees</th>
</tr>
<tr>
<td width="20%" align="left"><a accesskey="p" href="rep_bulk.html">Prev</a> </td>
<th width="60%" align="center">Chapter 12. Berkeley DB Replication </th>
<td width="20%" align="right"> <a accesskey="n" href="rep_lease.html">Next</a></td>
</tr>
</table>
<hr />
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="rep_trans"></a>Transactional guarantees</h2>
</div>
</div>
</div>
<p>
It is important to consider replication in the context of
the overall database environment's transactional guarantees.
To briefly review, transactional guarantees in a
non-replicated application are based on the writing of log
file records to "stable storage", usually a disk drive. If the
application or system then fails, the Berkeley DB logging
information is reviewed during recovery, and the databases are
updated so that all changes made as part of committed
transactions appear, and all changes made as part of
uncommitted transactions do not appear. In this case, no
information will have been lost.
</p>
<p>
If a database environment does not require the log be
flushed to stable storage on transaction commit (using the
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag to increase performance at the cost of
sacrificing transactional durability), Berkeley DB recovery
will only be able to restore the system to the state of the
last commit found on stable storage. In this case, information
may have been lost (for example, the changes made by some
committed transactions may not appear in the databases after
recovery).
</p>
<p>
Further, if there is database or log file loss or corruption
(for example, if a disk drive fails), then catastrophic
recovery is necessary, and Berkeley DB recovery will only be
able to restore the system to the state of the last archived
log file. In this case, information may also have been
lost.
</p>
<p>
Replicating the database environment extends this model, by
adding a new component to "stable storage": the client's
replicated information. If a database environment is
replicated, there is no lost information in the case of
database or log file loss, because the replicated system can
be configured to contain a complete set of databases and log
records up to the point of failure. A database environment
that loses a disk drive can have the drive replaced, and it
can then rejoin the replication group.
</p>
<p>
Because of this new component of stable storage, specifying
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> in a replicated environment no longer
sacrifices durability, as long as one or more clients have
acknowledged receipt of the messages sent by the master. Since
network connections are often faster than local synchronous
disk writes, replication becomes a way for applications to
significantly improve their performance as well as their
reliability.
</p>
<p>
Applications using Replication Manager are free to use
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> at the master and/or clients as they see fit.
The behavior of the <span class="bold"><strong>send</strong></span>
function that Replication Manager provides on the
application's behalf is determined by an "acknowledgement
policy", which is configured by the <a href="../api_reference/C/repmgrset_ack_policy.html" class="olink">DB_ENV->repmgr_set_ack_policy()</a> method.
Clients always send acknowledgements for <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>
messages (unless the acknowledgement policy in effect
indicates that the master doesn't care about them). For a
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> message, the master blocks the sending
thread until either it receives the proper number of
acknowledgements, or the <a href="../api_reference/C/repset_timeout.html#set_timeout_DB_REP_ACK_TIMEOUT" class="olink">DB_REP_ACK_TIMEOUT</a> expires. In the
case of timeout, Replication Manager returns an error code
from the <span class="bold"><strong>send</strong></span> function,
causing Berkeley DB to flush the transaction log before
returning to the application, as previously described. The
default acknowledgement policy is <a href="../api_reference/C/repmgrset_ack_policy.html#ackspolicy_DB_REPMGR_ACKS_QUORUM" class="olink">DB_REPMGR_ACKS_QUORUM</a>,
which ensures that the effect of a permanent record remains
durable following an election.
</p>
<p>
The rest of this section discusses transactional guarantee
considerations in Base API applications.
</p>
<p>
The return status from the application's <span class="bold"><strong>send</strong></span> function must be set by the
application to ensure the transactional guarantees the
application wants to provide. Whenever the <span class="bold"><strong>send</strong></span> function returns failure, the
local database environment's log is flushed as necessary to
ensure that any information critical to database integrity is
not lost. Because this flush is an expensive operation in
terms of database performance, applications should avoid
returning an error from the <span class="bold"><strong>send</strong></span> function,
if at all possible.
</p>
<p>
The only interesting message type for replication
transactional guarantees is when the application's <span class="bold"><strong>send</strong></span> function was called with the
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag specified. There is no reason for the
<span class="bold"><strong>send</strong></span> function to ever
return failure unless the <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag was
specified -- messages without the <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag do
not make visible changes to databases, and the <span class="bold"><strong>send</strong></span> function can return success to
Berkeley DB as soon as the message has been sent to the
client(s) or even just copied to local application memory in
preparation for being sent.
</p>
<p>
When a client receives a <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> message, the
client will flush its log to stable storage before returning
(unless the client environment has been configured with the
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> option). If the client is unable to flush a
complete transactional record to disk for any reason (for
example, there is a missing log record before the flagged
message), the call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method on the client
will return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> and return the LSN of this record
to the application in the <span class="bold"><strong>ret_lsnp</strong></span> parameter.
The application's client or master message handling loops should take proper
action to ensure the correct transactional guarantees in this case. When
missing records arrive and allow subsequent processing of
previously stored permanent records, the call to the
<a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method on the client will return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_ISPERM" class="olink">DB_REP_ISPERM</a>
and return the largest LSN of the permanent records that were
flushed to disk. Client applications can use these LSNs to
know definitively if any particular LSN is permanently stored
or not.
</p>
<p>
An application relying on a client's ability to become a
master and guarantee that no data has been lost will need to
write the <span class="bold"><strong>send</strong></span> function to
return an error whenever it cannot guarantee the site that
will win the next election has the record. Applications not
requiring this level of transactional guarantees need not have
the <span class="bold"><strong>send</strong></span> function return
failure (unless the master's database environment has been
configured with <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>), as any information critical
to database integrity has already been flushed to the local
log before <span class="bold"><strong>send</strong></span> was
called.
</p>
<p>
To sum up, the only reason for the <span class="bold"><strong>send</strong></span> function
to return failure is when the
master database environment has been configured to not
synchronously flush the log on transaction commit (that is,
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> was configured on the master), the
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag is specified for the message, and the
<span class="bold"><strong>send</strong></span> function was unable
to determine that some number of clients have received the
current message (and all messages preceding the current
message). How many clients need to receive the message before
the <span class="bold"><strong>send</strong></span> function can return
success is an application choice (and may not depend as much
on a specific number of clients reporting success as one or
more geographically distributed clients).
</p>
<p>
If, however, the application does require on-disk durability
on the master, the master should be configured to
synchronously flush the log on commit. If clients are not
configured to synchronously flush the log, that is, if a
client is running with <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> configured, then it is
up to the application to reconfigure that client appropriately
when it becomes a master. That is, the application must
explicitly call <a href="../api_reference/C/envset_flags.html" class="olink">DB_ENV->set_flags()</a> to disable asynchronous log
flushing as part of re-configuring the client as the new
master.
</p>
<p>
Of course, it is important to ensure that the replicated
master and client environments are truly independent of each
other. For example, it does not help matters that a client has
acknowledged receipt of a message if both master and clients
are on the same power supply, as the failure of the power
supply will still potentially lose information.
</p>
<p>
Configuring a Base API application to achieve the proper mix
of performance and transactional guarantees can be complex. In
brief, there are a few controls an application can set to
configure the guarantees it makes: specification of
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> for the master environment, specification of
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> for the client environment, the priorities of
different sites participating in an election, and the behavior
of the application's <span class="bold"><strong>send</strong></span>
function.
</p>
<p>
First, it is rarely useful to write and synchronously flush
the log when a transaction commits on a replication client. It
may be useful where systems share resources and multiple
systems commonly fail at the same time. By default, all
Berkeley DB database environments, whether master or client,
synchronously flush the log on transaction commit or prepare.
Generally, replication masters and clients turn log flush off
for transaction commit using the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag.
</p>
<p>
Consider two systems connected by a network interface. One
acts as the master, the other as a read-only client. The
client takes over as master if the master crashes and the
master rejoins the replication group after such a failure.
Both master and client are configured to not synchronously
flush the log on transaction commit (that is, <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>
was configured on both systems). The application's <span class="bold"><strong>send</strong></span> function never returns failure
to the Berkeley DB library, simply forwarding messages to the
client (perhaps over a broadcast mechanism), and always
returning success. On the client, any <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> returns
from the client's <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method are ignored, as well.
This system configuration has excellent performance, but may
lose data in some failure modes.
</p>
<p>
If both the master and the client crash at once, it is
possible to lose committed transactions, that is,
transactional durability is not being maintained. Reliability
can be increased by providing separate power supplies for the
systems and placing them in separate physical
locations.
</p>
<p>
If the connection between the two machines fails (or just
some number of messages are lost), and subsequently the master
crashes, it is possible to lose committed transactions. Again,
transactional durability is not being maintained. Reliability
can be improved in a couple of ways:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Use a reliable network protocol (for example,
TCP/IP instead of UDP).
</p>
</li>
<li>
<p>
Increase the number of clients and network paths to
make it less likely that a message will be lost. In
this case, it is important to also make sure a client
that did receive the message wins any subsequent
election. If a client that did not receive the message
wins a subsequent election, data can still be lost.
</p>
</li>
</ol>
</div>
<p>
Further, systems may want to guarantee message delivery to
the client(s) (for example, to prevent a network connection
from simply discarding messages). Some systems may want to
ensure clients never return out-of-date information, that is,
once a transaction commit returns success on the master, no
client will return old information to a read-only query. Some
of the following changes to a Base API application may be used
to address these issues:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Write the application's <span class="bold"><strong>send</strong></span> function
to not return to Berkeley DB until one or more clients have
acknowledged receipt of the message. The number of
clients chosen will be dependent on the application:
you will want to consider likely network partitions
(ensure that a client at each physical site receives
the message) and geographical diversity (ensure that a
client on each coast receives the message).
</p>
</li>
<li>
<p>
Write the client's message processing loop to not
acknowledge receipt of the message until a call to the
<a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method has returned success. Messages
resulting in a return of <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> from the
<a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method mean the message could not be
flushed to the client's disk. If the client does not
acknowledge receipt of such messages to the master
until a subsequent call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method
returns <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_ISPERM" class="olink">DB_REP_ISPERM</a> and the LSN returned is at
least as large as this message's LSN, then the
master's <span class="bold"><strong>send</strong></span>
function will not return success to the Berkeley DB
library. This means the thread committing the
transaction on the master will not be allowed to
proceed based on the transaction having committed
until the selected set of clients have received the
message and consider it complete.
</p>
<p>
Alternatively, the client's message processing loop
could acknowledge the message to the master, but with
an error code indicating that the application's
<span class="bold"><strong>send</strong></span> function
should not return to the Berkeley DB library until a
subsequent acknowledgement from the same client
indicates success.
</p>
<p>
The application send callback function invoked by
Berkeley DB contains an LSN of the record being sent
(if appropriate for that record). When <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a>
method returns indicators that a permanent record has
been written then it also returns the maximum LSN of
the permanent record written.
</p>
</li>
</ol>
</div>
<p>
There is one final pair of failure scenarios to consider.
First, it is not possible to abort transactions after the
application's <span class="bold"><strong>send</strong></span> function
has been called, as the master may have already written the
commit log records to disk, and so abort is no longer an
option. Second, a related problem is that even though the
master will attempt to flush the local log if the <span class="bold"><strong>send</strong></span> function returns failure, that
flush may fail (for example, when the local disk is full).
Again, the transaction cannot be aborted as one or more
clients may have committed the transaction even if <span class="bold"><strong>send</strong></span> returns failure. Rare
applications may not be able to tolerate these unlikely
failure modes. In that case the application may want
to:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Configure the master to do always local synchronous
commits (turning off the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>
configuration). This will decrease performance
significantly, of course (one of the reasons to use
replication is to avoid local disk writes.) In this
configuration, failure to write the local log will
cause the transaction to abort in all cases.
</p>
</li>
<li>
<p>
Do not return from the application's <span class="bold"><strong>send</strong></span> function under any
conditions, until the selected set of clients has
acknowledged the message. Until the <span class="bold"><strong>send</strong></span> function returns to
the Berkeley DB library, the thread committing the
transaction on the master will wait, and so no
application will be able to act on the knowledge that
the transaction has committed.
</p>
</li>
</ol>
</div>
</div>
<div class="navfooter">
<hr />
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left"><a accesskey="p" href="rep_bulk.html">Prev</a> </td>
<td width="20%" align="center">
<a accesskey="u" href="rep.html">Up</a>
</td>
<td width="40%" align="right"> <a accesskey="n" href="rep_lease.html">Next</a></td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Bulk transfer </td>
<td width="20%" align="center">
<a accesskey="h" href="index.html">Home</a>
</td>
<td width="40%" align="right" valign="top"> Master leases</td>
</tr>
</table>
</div>
</body>
</html>
Sindbad File Manager Version 1.0, Coded By Sindbad EG ~ The Terrorists