postgres

mirror of https://github.com/zebrajr/postgres.git synced 2025-12-07 00:20:24 +01:00

Author	SHA1	Message	Date
Kevin Grittner	bf9a60ee33	Fix interaction between CREATE INDEX and "snapshot too old". Since indexes are created without valid LSNs, an index created while a snapshot older than old_snapshot_threshold existed could cause queries to return incorrect results when those old snapshots were used, if any relevant rows had been subject to early pruning before the index was built. Prevent usage of a newly created index until all such snapshots are released, for relations where this can happen. Questions about the interaction of "snapshot too old" with index creation were initially raised by Andres Freund. Reviewed by Robert Haas.	2016-06-10 09:25:31 -05:00
Robert Haas	4bc424b968	pgindent run for 9.6	2016-06-09 18:02:36 -04:00
Greg Stark	e1623c3959	Fix various common mispellings. Mostly these are just comments but there are a few in documentation and a handful in code and tests. Hopefully this doesn't cause too much unnecessary pain for backpatching. I relented from some of the most common like "thru" for that reason. The rest don't seem numerous enough to cause problems. Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings	2016-06-03 16:08:45 +01:00
Alvaro Herrera	0c7cd45b6d	Fix range check for effective_io_concurrency Commit `1aba62ec` moved the range check of that option form guc.c into bufmgr.c, but introduced a bug by changing a >= 0.0 to > 0.0, which made the value 0 no longer accepted. Put it back. Reported by Jeff Janes, diagnosed by Tom Lane	2016-05-24 14:55:34 -04:00
Kevin Grittner	11e178d0dc	Inline initial comparisons in TestForOldSnapshot() Even with old_snapshot_threshold = -1 (which disables the "snapshot too old" feature), performance regressions were seen at moderate to high concurrency. For example, a one-socket, four-core system running 200 connections at saturation could see up to a 2.3% regression, with larger regressions possible on NUMA machines. By inlining the early (smaller, faster) tests in the TestForOldSnapshot() function, the i7 case dropped to a 0.2% regression, which could easily just be noise, and is clearly an improvement. Further testing will show whether more is needed.	2016-04-21 08:40:08 -05:00
Kevin Grittner	a343e223a5	Revert no-op changes to BufferGetPage() The reverted changes were intended to force a choice of whether any newly-added BufferGetPage() calls needed to be accompanied by a test of the snapshot age, to support the "snapshot too old" feature. Such an accompanying test is needed in about 7% of the cases, where the page is being used as part of a scan rather than positioning for other purposes (such as DML or vacuuming). The additional effort required for back-patching, and the doubt whether the intended benefit would really be there, have indicated it is best just to rely on developers to do the right thing based on comments and existing usage, as we do with many other conventions. This change should have little or no effect on generated executable code. Motivated by the back-patching pain of Tom Lane and Robert Haas	2016-04-20 08:31:19 -05:00
Tom Lane	a0382e2d7e	Make partition-lock-release coding more transparent in BufferAlloc(). Coverity complained that oldPartitionLock was possibly dereferenced after having been set to NULL. That actually can't happen, because we'd only use it if (oldFlags & BM_TAG_VALID) is true. But nonetheless Coverity is justified in complaining, because at line 1275 we actually overwrite oldFlags, and then still expect its BM_TAG_VALID bit to be a safe guide to whether to release the oldPartitionLock. Thus, the code would be incorrect if someone else had changed the buffer's BM_TAG_VALID flag meanwhile. That should not happen, since we hold pin on the buffer throughout this sequence, but it's starting to look like a rather shaky chain of logic. And there's no need for such assumptions, because we can simply replace the (oldFlags & BM_TAG_VALID) tests with (oldPartitionLock != NULL), which has identical results and makes it plain to all comers that we don't dereference a null pointer. A small side benefit is that the range of liveness of oldFlags is greatly reduced, possibly allowing the compiler to save a register. This is just cleanup, not an actual bug fix, so there seems no need for a back-patch.	2016-04-18 18:05:56 -04:00
Tom Lane	6b85d4ba9b	Fix portability problem induced by commit `a6f6b7819`. pg_xlogdump includes bufmgr.h. With a compiler that emits code for static inline functions even when they're unreferenced, that leads to unresolved external references in the new static-inline version of BufferGetPage(). So hide it with #ifndef FRONTEND, as we've done for similar issues elsewhere. Per buildfarm member pademelon.	2016-04-15 10:44:28 -04:00
Andres Freund	4b74c6a40e	Make init_spin_delay() C89 compliant #2 . My previous attempt at doing so, in `80abbeba23`, was not sufficient. While that fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant expressions in the struct initializer, because the file/line/function information comes from the caller of s_lock(). Give up on using a macro, and use a static inline instead. Discussion: 4369.1460435533@sss.pgh.pa.us	2016-04-14 19:26:13 -07:00
Andres Freund	80abbeba23	Make init_spin_delay() C89 compliant and change stuck spinlock reporting. The current definition of init_spin_delay (introduced recently in `48354581a`) wasn't C89 compliant. It's not legal to refer to refer to non-constant expressions, and the ptr argument was one. This, as reported by Tom, lead to a failure on buildfarm animal pademelon. The pointer, especially on system systems with ASLR, isn't super helpful anyway, though. So instead of making init_spin_delay into an inline function, make s_lock_stuck() report the function name in addition to file:line and change init_spin_delay() accordingly. While not a direct replacement, the function name is likely more useful anyway (line numbers are often hard to interpret in third party reports). This also fixes what file/line number is reported for waits via s_lock(). As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h. Reported-By: Tom Lane Discussion: 4369.1460435533@sss.pgh.pa.us	2016-04-13 17:00:53 -07:00
Kevin Grittner	a6f6b78196	Use static inline function for BufferGetPage() I was initially concerned that the some of the hundreds of references to BufferGetPage() where the literal BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a macro, leading to some hard-to-find performance regressions in corner cases. Inspection of disassembled code has shown identical code at all inspected locations, and the size difference doesn't amount to even one byte per such call. So make it readable. Per gripes from Álvaro Herrera and Tom Lane	2016-04-11 16:47:50 -05:00
Andres Freund	48354581a4	Allow Pin/UnpinBuffer to operate in a lockfree manner. Pinning/Unpinning a buffer is a very frequent operation; especially in read-mostly cache resident workloads. Benchmarking shows that in various scenarios the spinlock protecting a buffer header's state becomes a significant bottleneck. The problem can be reproduced with pgbench -S on larger machines, but can be considerably worse for queries which touch the same buffers over and over at a high frequency (e.g. nested loops over a small inner table). To allow atomic operations to be used, cram BufferDesc's flags, usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable; that allows to manipulate them together using 32bit compare-and-swap operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could be lifted by using a 64bit field, but it's not a realistic configuration atm). As not all operations can easily implemented in a lockfree manner, implement the previous buf_hdr_lock via a flag bit in the atomic variable. That way we can continue to lock the header in places where it's needed, but can get away without acquiring it in the more frequent hot-paths. There's some additional operations which can be done without the lock, but aren't in this patch; but the most important places are covered. As bufmgr.c now essentially re-implements spinlocks, abstract the delay logic from s_lock.c into something more generic. It now has already two users, and more are coming up; there's a follupw patch for lwlock.c at least. This patch is based on a proof-of-concept written by me, which Alexander Korotkov made into a fully working patch; the committed version is again revised by me. Benchmarking and testing has, amongst others, been provided by Dilip Kumar, Alexander Korotkov, Robert Haas. On a large x86 system improvements for readonly pgbench, with a high client count, of a factor of 8 have been observed. Author: Alexander Korotkov and Andres Freund Discussion: 2400449.GjM57CE0Yg@dinodell	2016-04-10 20:12:32 -07:00
Kevin Grittner	848ef42bb8	Add the "snapshot too old" feature This feature is controlled by a new old_snapshot_threshold GUC. A value of -1 disables the feature, and that is the default. The value of 0 is just intended for testing. Above that it is the number of minutes a snapshot can reach before pruning and vacuum are allowed to remove dead tuples which the snapshot would otherwise protect. The xmin associated with a transaction ID does still protect dead tuples. A connection which is using an "old" snapshot does not get an error unless it accesses a page modified recently enough that it might not be able to produce accurate results. This is similar to the Oracle feature, and we use the same SQLSTATE and error message for compatibility.	2016-04-08 14:36:30 -05:00
Kevin Grittner	8b65cf4c5e	Modify BufferGetPage() to prepare for "snapshot too old" feature This patch is a no-op patch which is intended to reduce the chances of failures of omission once the functional part of the "snapshot too old" patch goes in. It adds parameters for snapshot, relation, and an enum to specify whether the snapshot age check needs to be done for the page at this point. This initial patch passes NULL for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the third. The follow-on patch will change the places where the test needs to be made.	2016-04-08 14:30:10 -05:00
Noah Misch	4ad6f13500	Copyedit comments and documentation.	2016-04-01 21:53:10 -04:00
Robert Haas	3aff33aa68	Fix typos. Oskari Saarenmaa	2016-03-15 18:06:11 -04:00
Andres Freund	c94f0c29ce	Blindly try to fix dtrace enabled builds, broken in `9cd00c45`. Reported-By: Peter Eisentraut Discussion: 56E2239E.1050607@gmx.net	2016-03-10 17:51:03 -08:00
Andres Freund	9cd00c457e	Checkpoint sorting and balancing. Up to now checkpoints were written in the order they're in the BufferDescriptors. That's nearly random in a lot of cases, which performs badly on rotating media, but even on SSDs it causes slowdowns. To avoid that, sort checkpoints before writing them out. We currently sort by tablespace, relfilenode, fork and block number. One of the major reasons that previously wasn't done, was fear of imbalance between tablespaces. To address that balance writes between tablespaces. The other prime concern was that the relatively large allocation to sort the buffers in might fail, preventing checkpoints from happening. Thus pre-allocate the required memory in shared memory, at server startup. This particularly makes it more efficient to have checkpoint flushing enabled, because that'll often result in a lot of writes that can be coalesced into one flush. Discussion: alpine.DEB.2.10.1506011320000.28433@sto Author: Fabien Coelho and Andres Freund	2016-03-10 17:05:09 -08:00
Andres Freund	428b1d6b29	Allow to trigger kernel writeback after a configurable number of writes. Currently writes to the main data files of postgres all go through the OS page cache. This means that some operating systems can end up collecting a large number of dirty buffers in their respective page caches. When these dirty buffers are flushed to storage rapidly, be it because of fsync(), timeouts, or dirty ratios, latency for other reads and writes can increase massively. This is the primary reason for regular massive stalls observed in real world scenarios and artificial benchmarks; on rotating disks stalls on the order of hundreds of seconds have been observed. On linux it is possible to control this by reducing the global dirty limits significantly, reducing the above problem. But global configuration is rather problematic because it'll affect other applications; also PostgreSQL itself doesn't always generally want this behavior, e.g. for temporary files it's undesirable. Several operating systems allow some control over the kernel page cache. Linux has sync_file_range(2), several posix systems have msync(2) and posix_fadvise(2). sync_file_range(2) is preferable because it requires no special setup, whereas msync() requires the to-be-flushed range to be mmap'ed. For the purpose of flushing dirty data posix_fadvise(2) is the worst alternative, as flushing dirty data is just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages from the page cache. Thus the feature is enabled by default only on linux, but can be enabled on all systems that have any of the above APIs. While desirable and likely possible this patch does not contain an implementation for windows. With the infrastructure added, writes made via checkpointer, bgwriter and normal user backends can be flushed after a configurable number of writes. Each of these sources of writes controlled by a separate GUC, checkpointer_flush_after, bgwriter_flush_after and backend_flush_after respectively; they're separate because the number of flushes that are good are separate, and because the performance considerations of controlled flushing for each of these are different. A later patch will add checkpoint sorting - after that flushes from the ckeckpoint will almost always be desirable. Bgwriter flushes are most of the time going to be random, which are slow on lots of storage hardware. Flushing in backends works well if the storage and bgwriter can keep up, but if not it can have negative consequences. This patch is likely to have negative performance consequences without checkpoint sorting, but unfortunately so has sorting without flush control. Discussion: alpine.DEB.2.10.1506011320000.28433@sto Author: Fabien Coelho and Andres Freund	2016-03-10 17:04:34 -08:00
Robert Haas	53be0b1add	Provide much better wait information in pg_stat_activity. When a process is waiting for a heavyweight lock, we will now indicate the type of heavyweight lock for which it is waiting. Also, you can now see when a process is waiting for a lightweight lock - in which case we will indicate the individual lock name or the tranche, as appropriate - or for a buffer pin. Amit Kapila, Ildus Kurbangaliev, reviewed by me. Lots of helpful discussion and suggestions by many others, including Alexander Korotkov, Vladimir Borodin, and many others.	2016-03-10 12:44:09 -05:00
Andres Freund	ea56b06cf7	Fix wrong keysize in PrivateRefCountHash creation. In `4b4b680c3` I accidentally used sizeof(PrivateRefCountArray) instead of sizeof(PrivateRefCountEntry) when creating the refcount overflow hashtable. As the former is bigger than the latter, this luckily only resulted in a slightly increased memory usage when many buffers are pinned in a backend. Reported-By: Takashi Horikawa Discussion: 73FA3881462C614096F815F75628AFCD035A48C3@BPXM01GP.gisp.nec.co.jp Backpatch: 9.5, where thew new ref count infrastructure was introduced	2016-02-21 22:48:44 -08:00
Tom Lane	c5e9b77127	Revert "Temporarily make pg_ctl and server shutdown a whole lot chattier." This reverts commit `3971f64843` and a couple of followon debugging commits; I think we've learned what we can from them.	2016-02-10 16:01:04 -05:00
Tom Lane	3971f64843	Temporarily make pg_ctl and server shutdown a whole lot chattier. This is a quick hack, due to be reverted when its purpose has been served, to try to gather information about why some of the buildfarm critters regularly fail with "postmaster does not shut down" complaints. Maybe they are just really overloaded, but maybe something else is going on. Hence, instrument pg_ctl to print the current time when it starts waiting for postmaster shutdown and when it gives up, and add a lot of logging of the current time in the server's checkpoint and shutdown code paths. No attempt has been made to make this pretty. I'm not even totally sure if it will build on Windows, but we'll soon find out.	2016-02-08 18:43:11 -05:00
Bruce Momjian	ee94300446	Update copyright for 2016 Backpatch certain files through 9.1	2016-01-02 13:33:40 -05:00
Robert Haas	6150a1b08a	Move buffer I/O and content LWLocks out of the main tranche. Move the content lock directly into the BufferDesc, so that locking and pinning a buffer touches only one cache line rather than two. Adjust the definition of BufferDesc slightly so that this doesn't make the BufferDesc any larger than one cache line (at least on platforms where a spinlock is only 1 or 2 bytes). We can't fit the I/O locks into the BufferDesc and stay within one cache line, so move those to a completely separate tranche. This leaves a relatively limited number of LWLocks in the main tranche, so increase the padding of those remaining locks to a full cache line, rather than allowing adjacent locks to share a cache line, hopefully reducing false sharing. Performance testing shows that these changes make little difference on laptop-class machines, but help significantly on larger servers, especially those with more than 2 sockets. Andres Freund, originally based on an earlier patch by Simon Riggs. Review and cosmetic adjustments (including heavy rewriting of the comments) by me.	2015-12-15 13:32:54 -05:00
Andres Freund	2a3544960e	Correct statement to actually be the intended assert statement. `e3f4cfc7` introduced a LWLockHeldByMe() call, without the corresponding Assert() surrounding it. Spotted by Coverity. Backpatch: 9.1+, like the previous commit	2015-12-14 11:23:24 +01:00
Andres Freund	e3f4cfc7aa	Fix bug leading to restoring unlogged relations from empty files. At the end of crash recovery, unlogged relations are reset to the empty state, using their init fork as the template. The init fork is copied to the main fork without going through shared buffers. Unfortunately WAL replay so far has not necessarily flushed writes from shared buffers to disk at that point. In normal crash recovery, and before the introduction of 'fast promotions' in `fd4ced523` / 9.3, the END_OF_RECOVERY checkpoint flushes the buffers out in time. But with fast promotions that's not the case anymore. To fix, force WAL writes targeting the init fork to be flushed immediately (using the new FlushOneBuffer() function). In 9.5+ that flush can centrally be triggered from the code dealing with restoring full page writes (XLogReadBufferForRedoExtended), in earlier releases that responsibility is in the hands of XLOG_HEAP_NEWPAGE's replay function. Backpatch to 9.1, even if this currently is only known to trigger in 9.3+. Flushing earlier is more robust, and it is advantageous to keep the branches similar. Typical symptoms of this bug are errors like 'ERROR: index "..." contains unexpected zero page at block 0' shortly after promoting a node. Reported-By: Thom Brown Author: Andres Freund and Michael Paquier Discussion: 20150326175024.GJ451@alap3.anarazel.de Backpatch: 9.1-	2015-12-10 16:29:26 +01:00
Robert Haas	e93b62985f	Remove volatile qualifiers from bufmgr.c and freelist.c Prior to commit `0709b7ee72`, access to variables within a spinlock-protected critical section had to be done through a volatile pointer, but that should no longer be necessary. Review by Andres Freund	2015-11-16 18:50:06 -05:00
Alvaro Herrera	1aba62ec63	Allow per-tablespace effective_io_concurrency Per discussion, nowadays it is possible to have tablespaces that have wildly different I/O characteristics from others. Setting different effective_io_concurrency parameters for those has been measured to improve performance. Author: Julien Rouhaud Reviewed by: Andres Freund	2015-09-08 12:51:42 -03:00
Andres Freund	d25fbf9f3e	Fix two off-by-one errors in bufmgr.c. In `4b4b680c` I passed a buffer index number (starting from 0) instead of a proper Buffer id (which start from 1 for shared buffers) in two places. This wasn't noticed so far as one of those locations isn't compiled at all (PrintPinnedBufs) and the other one (InvalidBuffer) requires a unlikely, but possible, set of circumstances to trigger a symptom. To reduce the likelihood of such incidents a bit also convert existing open coded mappings from buffer descriptors to buffer ids with BufferDescriptorGetBuffer(). Author: Qingqing Zhou Reported-By: Qingqing Zhou Discussion: CAJjS0u2ai9ooUisKtkV8cuVUtEkMTsbK8c7juNAjv8K11zeCQg@mail.gmail.com Backpatch: 9.5 where the private ref count infrastructure was introduced	2015-08-12 17:35:50 +02:00
Heikki Linnakangas	4b8e24b9ad	Fix a couple of bugs with wal_log_hints. 1. Replay of the WAL record for setting a bit in the visibility map contained an assertion that a full-page image of that record type can only occur with checksums enabled. But it can also happen with wal_log_hints, so remove the assertion. Unlike checksums, wal_log_hints can be changed on the fly, so it would be complicated to figure out if it was enabled at the time that the WAL record was generated. 2. wal_log_hints has the same effect on the locking needed to read the LSN of a page as data checksums. BufferGetLSNAtomic() didn't get the memo. Backpatch to 9.4, where wal_log_hints was added.	2015-06-26 12:38:24 +03:00
Bruce Momjian	807b9e0dff	pgindent run for 9.5	2015-05-23 21:35:49 -04:00
Heikki Linnakangas	fa60fb63e5	Fix more typos in comments. Patch by CharSyam, plus a few more I spotted with grep.	2015-05-20 19:45:43 +03:00
Heikki Linnakangas	4fc72cc7bb	Collection of typo fixes. Use "a" and "an" correctly, mostly in comments. Two error messages were also fixed (they were just elogs, so no translation work required). Two function comments in pg_proc.h were also fixed. Etsuro Fujita reported one of these, but I found a lot more with grep. Also fix a few other typos spotted while grepping for the a/an typos. For example, "consists out of ..." -> "consists of ...". Plus a "though"/ "through" mixup reported by Euler Taveira. Many of these typos were in old code, which would be nice to backpatch to make future backpatching easier. But much of the code was new, and I didn't feel like crafting separate patches for each branch. So no backpatching.	2015-05-20 16:56:22 +03:00
Andres Freund	bc208a5a2f	Guard against spurious signals in LockBufferForCleanup. When LockBufferForCleanup() has to wait for getting a cleanup lock on a buffer it does so by setting a flag in the buffer header and then wait for other backends to signal it using ProcWaitForSignal(). Unfortunately LockBufferForCleanup() missed that ProcWaitForSignal() can return for other reasons than the signal it is hoping for. If such a spurious signal arrives the wait flags on the buffer header will still be set. That then triggers "ERROR: multiple backends attempting to wait for pincount 1". The fix is simple, unset the flag if still set when retrying. That implies an additional spinlock acquisition/release, but that's unlikely to matter given the cost of waiting for a cleanup lock. Alternatively it'd have been possible to move responsibility for maintaining the relevant flag to the waiter all together, but that might have had negative consequences due to possible floods of signals. Besides being more invasive. This looks to be a very longstanding bug. The relevant code in LockBufferForCleanup() hasn't changed materially since its introduction and ProcWaitForSignal() was documented to return for unrelated reasons since 8.2. The master only patch series removing ImmediateInterruptOK made it much easier to hit though, as ProcSendSignal/ProcWaitForSignal now uses a latch shared with other tasks. Per discussion with Kevin Grittner, Tom Lane and me. Backpatch to all supported branches. Discussion: 11553.1423805224@sss.pgh.pa.us	2015-02-23 16:14:14 +01:00
Andres Freund	ed127002d8	Align buffer descriptors to cache line boundaries. Benchmarks has shown that aligning the buffer descriptor array to cache lines is important for scalability; especially on bigger, multi-socket, machines. Currently the array sometimes already happens to be aligned by happenstance, depending how large previous shared memory allocations were. That can lead to wildly varying performance results after minor configuration changes. In addition to aligning the start of descriptor array, also force the size of individual descriptors to be of a common cache line size (64 bytes). That happens to already be the case on 64bit platforms, but this way we can change the struct BufferDesc more easily. As the alignment primarily matters in highly concurrent workloads which probably all are 64bit these days, and the space wastage of element alignment would be a bit more noticeable on 32bit systems, we don't force the stride to be cacheline sized on 32bit platforms for now. If somebody does actual performance testing, we can reevaluate that decision by changing the definition of BUFFERDESC_PADDED_SIZE. Discussion: 20140202151319.GD32123@awork2.anarazel.de Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter Geoghegan.	2015-01-29 22:48:45 +01:00
Andres Freund	7142bfbbd3	Fix #ifdefed'ed out code to compile again.	2015-01-29 22:48:45 +01:00
Heikki Linnakangas	acc2b1e843	Fix typo in comment.	2015-01-28 10:26:30 +02:00
Andres Freund	2d115e47c8	Fix various shortcomings of the new PrivateRefCount infrastructure. As noted by Tom Lane the improvements in `4b4b680c3d` had the problem that in some situations we searched, entered and modified entries in the private refcount hash while holding a spinlock. I had tried to keep the logic entirely local to PinBuffer_Locked(), but that's not really possible given it's called with a spinlock held... Besides being disadvantageous from a performance point of view, this also has problems with error handling safety. If we failed inserting an entry into the hashtable due to an out of memory error, we'd error out with a held spinlock. Not good. Change the way private refcounts are manipulated: Before a buffer can be tracked an entry has to be reserved using ReservePrivateRefCountEntry(); then, if a entry is not found using GetPrivateRefCountEntry(), it can be entered with NewPrivateRefCountEntry(). Also take advantage of the fact that PinBuffer_Locked() currently is never called for buffers that already have been pinned by the current backend and don't search the private refcount entries for preexisting local pins. That results in a small, but measurable, performance improvement. Additionally make ReleaseBuffer() always call UnpinBuffer() for shared buffers. That avoids duplicating work in an eventual UnpinBuffer() call that already has been done in ReleaseBuffer() and also saves some code. Per discussion with Tom Lane. Discussion: 15028.1418772313@sss.pgh.pa.us	2015-01-19 23:59:41 +01:00
Bruce Momjian	4baaf863ec	Update copyright for 2015 Backpatch certain files through 9.0	2015-01-06 11:43:47 -05:00
Tom Lane	4a14f13a0a	Improve hash_create's API for selecting simple-binary-key hash functions. Previously, if you wanted anything besides C-string hash keys, you had to specify a custom hashing function to hash_create(). Nearly all such callers were specifying tag_hash or oid_hash; which is tedious, and rather error-prone, since a caller could easily miss the opportunity to optimize by using hash_uint32 when appropriate. Replace this with a design whereby callers using simple binary-data keys just specify HASH_BLOBS and don't need to mess with specific support functions. hash_create() itself will take care of optimizing when the key size is four bytes. This nets out saving a few hundred bytes of code space, and offers a measurable performance improvement in tidbitmap.c (which was not exploiting the opportunity to use hash_uint32 for its 4-byte keys). There might be some wins elsewhere too, I didn't analyze closely. In future we could look into offering a similar optimized hashing function for 8-byte keys. Under this design that could be done in a centralized and machine-independent fashion, whereas getting it right for keys of platform-dependent sizes would've been notationally painful before. For the moment, the old way still works fine, so as not to break source code compatibility for loadable modules. Eventually we might want to remove tag_hash and friends from the exported API altogether, since there's no real need for them to be explicitly referenced from outside dynahash.c. Teodor Sigaev and Tom Lane	2014-12-18 13:36:36 -05:00
Heikki Linnakangas	81c4508196	Fix race condition between hot standby and restoring a full-page image. There was a window in RestoreBackupBlock where a page would be zeroed out, but not yet locked. If a backend pinned and locked the page in that window, it saw the zeroed page instead of the old page or new page contents, which could lead to missing rows in a result set, or errors. To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins, zeroes, and locks the page, if it's not in the buffer cache already. In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE, to avoid breaking any 3rd party extensions that might use RBM_ZERO. More importantly, this avoids renumbering the other enum values, which would cause even bigger confusion in extensions that use ReadBufferExtended, but haven't been recompiled. Backpatch to all supported versions; this has been racy since hot standby was introduced.	2014-11-13 20:02:37 +02:00
Heikki Linnakangas	2076db2aea	Move the backup-block logic from XLogInsert to a new file, xloginsert.c. xlog.c is huge, this makes it a little bit smaller, which is nice. Functions related to putting together the WAL record are in xloginsert.c, and the lower level stuff for managing WAL buffers and such are in xlog.c. Also move the definition of XLogRecord to a separate header file. This causes churn in the #includes of all the files that write WAL records, and redo routines, but it avoids pulling in xlog.h into most places. Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.	2014-11-06 13:55:36 +02:00
Andres Freund	7dbb606938	Flush unlogged table's buffers when copying or moving databases. CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source database directory on the filesystem level. To ensure the on disk state is consistent they block out users of the affected database and force a checkpoint to flush out all data to disk. Unfortunately, up to now, that checkpoint didn't flush out dirty buffers from unlogged relations. That bug means there could be leftover dirty buffers in either the template database, or the database in its old location. Leading to problems when accessing relations in an inconsistent state; and to possible problems during shutdown in the SET TABLESPACE case because buffers belonging files that don't exist anymore are flushed. This was reported in bug #10675 by Maxim Boguk. Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and Fujii Masao. Backpatch to 9.1 where unlogged tables were introduced.	2014-10-20 23:43:46 +02:00
Robert Haas	5d7962c679	Change locking regimen around buffer replacement. Previously, we used an lwlock that was held from the time we began seeking a candidate buffer until the time when we found and pinned one, which is disastrous for concurrency. Instead, use a spinlock which is held just long enough to pop the freelist or advance the clock sweep hand, and then released. If we need to advance the clock sweep further, we reacquire the spinlock once per buffer. This represents a significant increase in atomic operations around buffer eviction, but it still wins on many workloads. On others, it may result in no gain, or even cause a regression, unless the number of buffer mapping locks is also increased. However, that seems like material for a separate commit. We may also need to consider other methods of mitigating contention on this spinlock, such as splitting it into multiple locks or jumping the clock sweep hand more than one buffer at a time, but those, too, seem like separate improvements. Patch by me, inspired by a much larger patch from Amit Kapila. Reviewed by Andres Freund.	2014-09-25 10:43:24 -04:00
Andres Freund	4b4b680c3d	Make backend local tracking of buffer pins memory efficient. Since the dawn of time (aka Postgres95) multiple pins of the same buffer by one backend have been optimized not to modify the shared refcount more than once. This optimization has always used a NBuffer sized array in each backend keeping track of a backend's pins. That array (PrivateRefCount) was one of the biggest per-backend memory allocations, depending on the shared_buffers setting. Besides the waste of memory it also has proven to be a performance bottleneck when assertions are enabled as we make sure that there's no remaining pins left at the end of transactions. Also, on servers with lots of memory and a correspondingly high shared_buffers setting the amount of random memory accesses can also lead to poor cpu cache efficiency. Because of these reasons a backend's buffers pins are now kept track of in a small statically sized array that overflows into a hash table when necessary. Benchmarks have shown neutral to positive performance results with considerably lower memory usage. Patch by me, review by Robert Haas. Discussion: 20140321182231.GA17111@alap3.anarazel.de	2014-08-30 14:03:21 +02:00
Tom Lane	27cef0a561	Check block number against the correct fork in get_raw_page(). get_raw_page tried to validate the supplied block number against RelationGetNumberOfBlocks(), which of course is only right when accessing the main fork. In most cases, the main fork is longer than the others, so that the check was too weak (allowing a lower-level error to be reported, but no real harm to be done). However, very small tables could have an FSM larger than their heap, in which case the mistake prevented access to some FSM pages. Per report from Torsten Foertsch. In passing, make the bad-block-number error into an ereport not elog (since it's certainly not an internal error); and fix sloppily maintained comment for RelationGetNumberOfBlocksInFork. This has been wrong since we invented relation forks, so back-patch to all supported branches.	2014-07-22 11:46:29 -04:00
Andres Freund	3bdcf6a5a7	Don't allow to disable backend assertions via the debug_assertions GUC. The existance of the assert_enabled variable (backing the debug_assertions GUC) reduced the amount of knowledge some static code checkers (like coverity and various compilers) could infer from the existance of the assertion. That could have been solved by optionally removing the assertion_enabled variable from the Assert() et al macros at compile time when some special macro is defined, but the resulting complication doesn't seem to be worth the gain from having debug_assertions. Recompiling is fast enough. The debug_assertions GUC is still available, but readonly, as it's useful when diagnosing problems. The commandline/client startup option -A, which previously also allowed to enable/disable assertions, has been removed as it doesn't serve a purpose anymore. While at it, reduce code duplication in bufmgr.c and localbuf.c assertions checking for spurious buffer pins. That code had to be reindented anyway to cope with the assert_enabled removal.	2014-06-20 11:09:17 +02:00
Bruce Momjian	0a78320057	pgindent run for 9.4 This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.	2014-05-06 12:12:18 -04:00
Tom Lane	2d00190495	Rationalize common/relpath.[hc]. Commit `a730183926` created rather a mess by putting dependencies on backend-only include files into include/common. We really shouldn't do that. To clean it up: * Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in catalog/catalog.h. We won't consider this symbol part of the FE/BE API. * Push enum ForkNumber from relfilenode.h into relpath.h. We'll consider relpath.h as the source of truth for fork numbers, since relpath.c was already partially serving that function, and anyway relfilenode.h was kind of a random place for that enum. * So, relfilenode.h now includes relpath.h rather than vice-versa. This direction of dependency is fine. (That allows most, but not quite all, of the existing explicit #includes of relpath.h to go away again.) * Push forkname_to_number from catalog.c to relpath.c, just to centralize fork number stuff a bit better. * Push GetDatabasePath from catalog.c to relpath.c; it was rather odd that the previous commit didn't keep this together with relpath(). * To avoid needing relfilenode.h in common/, redefine the underlying function (now called GetRelationPath) as taking separate OID arguments, and make the APIs using RelFileNode or RelFileNodeBackend into macro wrappers. (The macros have a potential multiple-eval risk, but none of the existing call sites have an issue with that; one of them had such a risk already anyway.) * Fix failure to follow the directions when "init" fork type was added; specifically, the errhint in forkname_to_number wasn't updated, and neither was the SGML documentation for pg_relation_size(). * Fix tablespace-path-too-long check in CreateTableSpace() to account for fork-name component of maximum-length pathnames. This requires putting FORKNAMECHARS into a header file, but it was rather useless (and actually unreferenced) where it was. The last couple of items are potentially back-patchable bug fixes, if anyone is sufficiently excited about them; but personally I'm not. Per a gripe from Christoph Berg about how include/common wasn't self-contained.	2014-04-30 17:30:50 -04:00
Robert Haas	066254cea1	Count buffers dirtied due to hints in pgBufferUsage.shared_blks_dirtied. Previously, such buffers weren't counted, with the possible result that EXPLAIN (BUFFERS) and pg_stat_statements would understate the true number of blocks dirtied by an SQL statement. Back-patch to 9.2, where this counter was introduced. Amit Kapila	2014-03-31 13:06:26 -04:00
Robert Haas	ea9df812d8	Relax the requirement that all lwlocks be stored in a single array. This makes it possible to store lwlocks as part of some other data structure in the main shared memory segment, or in a dynamic shared memory segment. There is still a main LWLock array and this patch does not move anything out of it, but it provides necessary infrastructure for doing that in the future. This change is likely to increase the size of LWLockPadded on some platforms, especially 32-bit platforms where it was previously only 16 bytes. Patch by me. Review by Andres Freund and KaiGai Kohei.	2014-01-27 11:07:44 -05:00
Tom Lane	061b079f89	Fix multiple bugs in index page locking during hot-standby WAL replay. In ordinary operation, VACUUM must be careful to take a cleanup lock on each leaf page of a btree index; this ensures that no indexscans could still be "in flight" to heap tuples due to be deleted. (Because of possible index-tuple motion due to concurrent page splits, it's not enough to lock only the pages we're deleting index tuples from.) In Hot Standby, the WAL replay process must likewise lock every leaf page. There were several bugs in the code for that: * The replay scan might come across unused, all-zero pages in the index. While btree_xlog_vacuum itself did the right thing (ie, nothing) with such pages, xlogutils.c supposed that such pages must be corrupt and would throw an error. This accounts for various reports of replication failures with "PANIC: WAL contains references to invalid pages". To fix, add a ReadBufferMode value that instructs XLogReadBufferExtended not to complain when we're doing this. * btree_xlog_vacuum performed the extra locking if standbyState == STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up for hot standby queries until the database has reached consistency, and we don't want to do the extra locking till then either, for fear of reading corrupted pages (which bufmgr.c would complain about). Fix by exporting a new function from xlog.c that will report whether we're actually in hot standby replay mode. * To ensure full coverage of the index in the replay scan, btvacuumscan would emit a dummy WAL record for the last page of the index, if no vacuuming work had been done on that page. However, if the last page of the index is all-zero, that would result in corruption of said page, since the functions called on it weren't prepared to handle that case. There's no need to lock any such pages, so change the logic to target the last normal leaf page instead. The first two of these bugs were diagnosed by Andres Freund, the other one by me. Fixes based on ideas from Heikki Linnakangas and myself. This has been wrong since Hot Standby was introduced, so back-patch to 9.0.	2014-01-14 17:35:21 -05:00
Bruce Momjian	7e04792a1c	Update copyright for 2014 Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.	2014-01-07 16:05:30 -05:00
Heikki Linnakangas	50e547096c	Add GUC to enable WAL-logging of hint bits, even with checksums disabled. WAL records of hint bit updates is useful to tools that want to examine which pages have been modified. In particular, this is required to make the pg_rewind tool safe (without checksums). This can also be used to test how much extra WAL-logging would occur if you enabled checksums, without actually enabling them (which you can't currently do without re-initdb'ing). Sawada Masahiko, docs by Samrat Revagade. Reviewed by Dilip Kumar, with further changes by me.	2013-12-13 16:26:14 +02:00
Robert Haas	05ee328d66	Fix typo in comment. Etsuro Fujita	2013-08-02 09:15:42 -04:00
Jeff Davis	b8fd1a09f3	Add buffer_std flag to MarkBufferDirtyHint(). MarkBufferDirtyHint() writes WAL, and should know if it's got a standard buffer or not. Currently, the only callers where buffer_std is false are related to the FSM. In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive. Back-patch to 9.3.	2013-06-17 08:02:12 -07:00
Tom Lane	f04216341d	Refactor checksumming code to make it easier to use externally. pg_filedump and other external utility programs are likely to want to be able to check Postgres page checksums. To avoid messy duplication of code, move the checksumming functionality into an exported header file, much as we did awhile back for the CRC code. In passing, get rid of an unportable assumption that a static char[] array will be word-aligned, and do some other minor code beautification.	2013-06-13 22:35:56 -04:00
Bruce Momjian	9af4159fce	pgindent run for release 9.3 This is the first run of the Perl-based pgindent script. Also update pgindent instructions.	2013-05-29 16:58:43 -04:00
Simon Riggs	47c4333189	Avoid tricky race condition recording XLOG_HINT We copy the buffer before inserting an XLOG_HINT to avoid WAL CRC errors caused by concurrent hint writes to buffer while share locked. To make this work we refactor RestoreBackupBlock() to allow an XLOG_HINT to avoid the normal path for backup blocks, which assumes the underlying buffer is exclusive locked. Resulting code completely changes layout of XLOG_HINT WAL records, but this isn't even beta code, so this is a low impact change. In passing, avoid taking WALInsertLock for full page writes on checksummed hints, remove related cruft from XLogInsert() and improve xlog_desc record for XLOG_HINT. Andres Freund Bug report by Fujii Masao, testing by Jeff Janes and Jaime Casanova, review by Jeff Davis and Simon Riggs. Applied with changes from review and some comment editing.	2013-04-08 08:52:39 +01:00
Simon Riggs	1be203519a	Tune BufferGetLSNAtomic() when checksums !enabled From performance analysis by Heikki Linnakangas	2013-04-07 22:37:39 +01:00
Simon Riggs	96ef3b8ff1	Allow I/O reliability checks using 16-bit checksums Checksums are set immediately prior to flush out of shared buffers and checked when pages are read in again. Hint bit setting will require full page write when block is dirtied, which causes various infrastructure changes. Extensive comments, docs and README. WARNING message thrown if checksum fails on non-all zeroes page; ERROR thrown but can be disabled with ignore_checksum_failure = on. Feature enabled by an initdb option, since transition from option off to option on is long and complex and has not yet been implemented. Default is not to use checksums. Checksum used is WAL CRC-32 truncated to 16-bits. Simon Riggs, Jeff Davis, Greg Smith Wide input and assistance from many community members. Thank you.	2013-03-22 13:54:07 +00:00
Tom Lane	dcafdbcde1	Improve error reporting in code that checks for buffer refcount leaks. Formerly we just Assert'ed that each refcount was zero, which was quick and easy but failed to provide a good overview of what was wrong. Change the code so that we'll call PrintBufferLeakWarning() for each buffer with a nonzero refcount, and then Assert at the end of the loop. This costs nothing in runtime and might ease diagnosis of some bugs. Greg Smith, reviewed by Satoshi Nagayasu, further tweaked by me	2013-03-15 12:26:26 -04:00
Alvaro Herrera	a730183926	Move relpath() to libpgcommon This enables non-backend code, such as pg_xlogdump, to use it easily. The previous location, in src/backend/catalog/catalog.c, made that essentially impossible because that file depends on many backend-only facilities; so this needs to live separately.	2013-02-21 22:46:17 -03:00
Heikki Linnakangas	62401db45c	Support unlogged GiST index. The reason this wasn't supported before was that GiST indexes need an increasing sequence to detect concurrent page-splits. In a regular WAL- logged GiST index, the LSN of the page-split record is used for that purpose, and in a temporary index, we can get away with a backend-local counter. Neither of those methods works for an unlogged relation. To provide such an increasing sequence of numbers, create a "fake LSN" counter that is saved and restored across shutdowns. On recovery, unlogged relations are blown away, so the counter doesn't need to survive that either. Jeevan Chalke, based on discussions with Robert Haas, Tom Lane and me.	2013-02-11 23:07:09 +02:00
Alvaro Herrera	279628a0a7	Accelerate end-of-transaction dropping of relations When relations are dropped, at end of transaction we need to remove the files and clean the buffer pool of buffers containing pages of those relations. Previously we would scan the buffer pool once per relation to clean up buffers. When there are many relations to drop, the repeated scans make this process slow; so we now instead pass a list of relations to drop and scan the pool once, checking each buffer against the passed list. When the number of relations is larger than a threshold (which as of this patch is being set to 20 relations) we sort the array before starting, and bsearch the array; when it's smaller, we simply scan the array linearly each time, because that's faster. The exact optimal threshold value depends on many factors, but the difference is not likely to be significant enough to justify making it user-settable. This has been measured to be a significant win (a 15x win when dropping 100,000 relations; an extreme case, but reportedly a real one). Author: Tomas Vondra, some tweaks by me Reviewed by: Robert Haas, Shigeru Hanada, Andres Freund, Álvaro Herrera	2013-01-17 16:13:17 -03:00
Bruce Momjian	bd61a623ac	Update copyrights for 2013 Fully update git head, and update back branches in ./COPYRIGHT and legal.sgml files.	2013-01-01 17:15:01 -05:00
Heikki Linnakangas	dbdf9679d7	Use correct text domain for translating errcontext() messages. errcontext() is typically used in an error context callback function, not within an ereport() invocation like e.g errmsg and errdetail are. That means that the message domain that the TEXTDOMAIN magic in ereport() determines is not the right one for the errcontext() calls. The message domain needs to be determined by the C file containing the errcontext() call, not the file containing the ereport() call. Fix by turning errcontext() into a macro that passes the TEXTDOMAIN to use for the errcontext message. "errcontext" was used in a few places as a variable or struct field name, I had to rename those out of the way, now that errcontext is a macro. We've had this problem all along, but this isn't doesn't seem worth backporting. It's a fairly minor issue, and turning errcontext from a function to a macro requires at least a recompile of any external code that calls errcontext().	2012-11-12 17:07:29 +02:00
Tom Lane	9bacf0e373	Revert "Use "transient" files for blind writes, take 2". This reverts commit `fba105b109`. That approach had problems with the smgr-level state not tracking what we really want to happen, and with the VFD-level state not tracking the smgr-level state very well either. In consequence, it was still possible to hold kernel file descriptors open for long-gone tables (as in recent report from Tore Halset), and yet there were also cases of FDs being closed undesirably soon. A replacement implementation will follow.	2012-10-17 12:37:08 -04:00
Simon Riggs	64e196b6ef	Fix bufmgr so CHECKPOINT_END_OF_RECOVERY behaves as a shutdown checkpoint. Recovery code documents clearly that a shutdown checkpoint is executed at end of recovery - a shutdown checkpoint WAL record is written but the buffer manager had been altered to treat end of recovery as a normal checkpoint. This bug exacerbates the bufmgr relpersistence bug. Bug spotted by Andres Freund, patch by me.	2012-09-16 19:53:34 +01:00
Robert Haas	beb850e1d8	Properly set relpersistence for fake relcache entries. This can result in buffers failing to be properly flushed at checkpoint time, leading to data loss. Report, diagnosis, and patch by Jeff Davis.	2012-09-14 09:35:07 -04:00
Alvaro Herrera	45326c5a11	Split resowner.h This lets files that are mere users of ResourceOwner not automatically include the headers for stuff that is managed by the resowner mechanism.	2012-08-28 18:02:07 -04:00
Tom Lane	73b796a52c	Improve coding around the fsync request queue. In all branches back to 8.3, this patch fixes a questionable assumption in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are no uninitialized pad bytes in the request queue structs. This would only cause trouble if (a) there were such pad bytes, which could happen in 8.4 and up if the compiler makes enum ForkNumber narrower than 32 bits, but otherwise would require not-currently-planned changes in the widths of other typedefs; and (b) the kernel has not uniformly initialized the contents of shared memory to zeroes. Still, it seems a tad risky, and we can easily remove any risk by pre-zeroing the request array for ourselves. In addition to that, we need to establish a coding rule that struct RelFileNode can't contain any padding bytes, since such structs are copied into the request array verbatim. (There are other places that are assuming this anyway, it turns out.) In 9.1 and up, the risk was a bit larger because we were also effectively assuming that struct RelFileNodeBackend contained no pad bytes, and with fields of different types in there, that would be much easier to break. However, there is no good reason to ever transmit fsync or delete requests for temp files to the bgwriter/checkpointer, so we can revert the request structs to plain RelFileNode, getting rid of the padding risk and saving some marginal number of bytes and cycles in fsync queue manipulation while we are at it. The savings might be more than marginal during deletion of a temp relation, because the old code transmitted an entirely useless but nonetheless expensive-to-process ForgetRelationFsync request to the background process, and also had the background process perform the file deletion even though that can safely be done immediately. In addition, make some cleanup of nearby comments and small improvements to the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.	2012-07-17 16:56:54 -04:00
Bruce Momjian	927d61eeff	Run pgindent on 9.2 source tree in preparation for first 9.3 commit-fest.	2012-06-10 15:20:04 -04:00
Tom Lane	ece01aae47	Scan the buffer pool just once, not once per fork, during relation drop. This provides a speedup of about 4X when NBuffers is large enough. There is also a useful reduction in sinval traffic, since we only do CacheInvalidateSmgr() once not once per fork. Simon Riggs, reviewed and somewhat revised by Tom Lane	2012-06-07 17:43:11 -04:00
Tom Lane	e8d029a30b	Do unlocked prechecks in bufmgr.c loops that scan the whole buffer pool. DropRelFileNodeBuffers, DropDatabaseBuffers, FlushRelationBuffers, and FlushDatabaseBuffers have to scan the whole shared_buffers pool because we have no index structure that would find the target buffers any more efficiently than that. This gets expensive with large NBuffers. We can shave some cycles from these loops by prechecking to see if the current buffer is interesting before we acquire the buffer header lock. Ordinarily such a test would be unsafe, but in these cases it should be safe because we are already assuming that the caller holds a lock that prevents any new target pages from being loaded into the buffer pool concurrently. Therefore, no buffer tag should be changing to a value of interest, only away from a value of interest. So a false negative match is impossible, while a false positive is safe because we'll recheck after acquiring the buffer lock. Initial testing says that this speeds these loops by a factor of 2X to 3X on common Intel hardware. Patch for DropRelFileNodeBuffers by Jeff Janes (based on an idea of Heikki's); extended to the remaining sequential scans by Tom Lane	2012-06-07 16:46:26 -04:00
Tom Lane	6308ba05a7	Improve control logic for bgwriter hibernation mode. Commit `6d90eaaa89` added a hibernation mode to the bgwriter to reduce the server's idle-power consumption. However, its interaction with the detailed behavior of BgBufferSync's feedback control loop wasn't very well thought out. That control loop depends primarily on the rate of buffer allocation, not the rate of buffer dirtying, so the hibernation mode has to be designed to operate only when no new buffer allocations are happening. Also, the check for whether the system is effectively idle was not quite right and would fail to detect a constant low level of activity, thus allowing the bgwriter to go into hibernation mode in a way that would let the cycle time vary quite a bit, possibly further confusing the feedback loop. To fix, move the wakeup support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode unless no buffer allocations have happened recently. In addition, fix the delaying logic to remove the problem of possibly not responding to signals promptly, which was basically caused by trying to use the process latch's is_set flag for multiple purposes. I can't prove it but I'm suspicious that that hack was responsible for the intermittent "postmaster does not shut down" failures we've been seeing in the buildfarm lately. In any case it did nothing to improve the readability or robustness of the code. In passing, express the hibernation sleep time as a multiplier on BgWriterDelay, not a constant. I'm not sure whether there's any value in exposing the longer sleep time as an independently configurable setting, but we can at least make it act like this for little extra code.	2012-05-09 23:37:10 -04:00
Tom Lane	1dd89eadcd	Rename I/O timing statistics columns to blk_read_time and blk_write_time. This seems more consistent with the pre-existing choices for names of other statistics columns. Rename assorted internal identifiers to match.	2012-04-29 18:13:33 -04:00
Tom Lane	309c64745e	Rename track_iotiming GUC to track_io_timing. This spelling seems significantly more readable to me.	2012-04-29 16:23:54 -04:00
Robert Haas	ab77b2da8b	Fix incorrect comment in SetBufferCommitInfoNeedsSave(). Noah Misch spotted the fact that the old comment is in fact incorrect, due to memory ordering hazards.	2012-04-18 10:55:40 -04:00
Robert Haas	644828908f	Expose track_iotiming data via the statistics collector. Ants Aasma's original patch to add timing information for buffer I/O requests exposed this data at the relation level, which was judged too costly. I've here exposed it at the database level instead.	2012-04-05 11:40:24 -04:00
Robert Haas	40b9b95769	New GUC, track_iotiming, to track I/O timings. Currently, the only way to see the numbers this gathers is via EXPLAIN (ANALYZE, BUFFERS), but the plan is to add visibility through the stats collector and pg_stat_statements in subsequent patches. Ants Aasma, reviewed by Greg Smith, with some further changes by me.	2012-03-27 14:55:02 -04:00
Robert Haas	2254367435	Make EXPLAIN (BUFFERS) track blocks dirtied, as well as those written. Also expose the new counters through pg_stat_statements. Patch by me. Review by Fujii Masao and Greg Smith.	2012-02-22 20:33:05 -05:00
Heikki Linnakangas	6d90eaaa89	Make bgwriter sleep longer when it has no work to do, to save electricity. To make it wake up promptly when activity starts again, backends nudge it by setting a latch in MarkBufferDirty(). The latch is kept set while bgwriter is active, so there is very little overhead from that when the system is busy. It is only armed before going into longer sleep. Peter Geoghegan, with some changes by me.	2012-01-26 18:39:13 +02:00
Robert Haas	7e4911b2ae	Fix variable confusion in BufferSync(). As noted by Heikki Linnakangas, the previous coding confused the "flags" variable with the "mask" variable. The affect of this appears to be that unlogged buffers would get written out at every checkpoint rather than only at shutdown time. Although that's arguably an acceptable failure mode, I'm back-patching this change, since it seems like a poor idea to rely on this happening to work.	2012-01-06 08:35:48 -05:00
Bruce Momjian	e126958c2e	Update copyright notices for year 2012.	2012-01-01 18:01:58 -05:00
Alvaro Herrera	9d3b502443	Improve logging of autovacuum I/O activity This adds some I/O stats to the logging of autovacuum (when the operation takes long enough that log_autovacuum_min_duration causes it to be logged), so that it is easier to tune. Notably, it adds buffer I/O counts (hits, misses, dirtied) and read and write rate. Authors: Greg Smith and Noah Misch	2011-11-25 16:34:32 -03:00
Tom Lane	40d35036bb	Avoid floating-point underflow while tracking buffer allocation rate. When the system is idle for awhile after activity, the "smoothed_alloc" state variable in BgBufferSync converges slowly to zero. With standard IEEE float arithmetic this results in several iterations with denormalized values, which causes kernel traps and annoying log messages on some poorly-designed platforms. There's no real need to track such small values of smoothed_alloc, so we can prevent the kernel traps by forcing it to zero as soon as it's too small to be interesting for our purposes. This issue is purely cosmetic, since the iterations don't happen fast enough for the kernel traps to pose any meaningful performance problem, but still it seems worth shutting up the log messages. The kernel log messages were previously reported by a number of people, but kudos to Greg Matthews for tracking down exactly where they were coming from.	2011-11-19 00:35:29 -05:00
Simon Riggs	806a2aee37	Split work of bgwriter between 2 processes: bgwriter and checkpointer. bgwriter is now a much less important process, responsible for page cleaning duties only. checkpointer is now responsible for checkpoints and so has a key role in shutdown. Later patches will correct doc references to the now old idea that bgwriter performs checkpoints. Has beneficial effect on performance at high write rates, but mainly refactoring to more easily allow changes for power reduction by simplifying previously tortuous code around required to allow page cleaning and checkpointing to time slice in the same process. Patch by me, Review by Dickson Guedes	2011-11-01 17:14:47 +00:00
Robert Haas	53f1ca59b5	Allow hint bits to be set sooner for temporary and unlogged tables. We need not wait until the commit record is durably on disk, because in the event of a crash the page we're updating with hint bits will be gone anyway. Per off-list report from Heikki Linnakangas, this can significantly degrade the performance of unlogged tables; I was able to show a 2x speedup from this patch on a pgbench run with scale factor 15. In practice, this will mostly help small, heavily updated tables, because on larger tables you're unlikely to run into the same row again before the commit record makes it out to disk.	2011-10-28 17:08:09 -04:00
Tom Lane	a7801b62f2	Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h. As per my recent proposal, this refactors things so that these typedefs and macros are available in a header that can be included in frontend-ish code. I also changed various headers that were undesirably including utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly, this showed that half the system was getting utils/timestamp.h by way of xlog.h. No actual code changes here, just header refactoring.	2011-09-09 13:23:41 -04:00
Tom Lane	2e53bd5517	Fix incorrect initialization of ProcGlobal->startupBufferPinWaitBufId. It was initialized in the wrong place and to the wrong value. With bad luck this could result in incorrect query-cancellation failures in hot standby sessions, should a HS backend be holding pin on buffer number 1 while trying to acquire a lock.	2011-08-02 13:23:52 -04:00
Peter Eisentraut	21f1e15aaf	Unify spelling of "canceled", "canceling", "cancellation" We had previously (`af26857a27`) established the U.S. spellings as standard.	2011-06-29 09:28:46 +03:00
Peter Eisentraut	8a8fbe7e79	Capitalization fixes	2011-06-19 00:37:30 +03:00
Alvaro Herrera	fba105b109	Use "transient" files for blind writes, take 2 "Blind writes" are a mechanism to push buffers down to disk when evicting them; since they may belong to different databases than the one a backend is connected to, the backend does not necessarily have a relation to link them to, and thus no way to blow them away. We were keeping those files open indefinitely, which would cause a problem if the underlying table was deleted, because the operating system would not be able to reclaim the disk space used by those files. To fix, have bufmgr mark such files as transient to smgr; the lower layer is allowed to close the file descriptor when the current transaction ends. We must be careful to have any other access of the file to remove the transient markings, to prevent unnecessary expensive system calls when evicting buffers belonging to our own database (which files we're likely to require again soon.) This commit fixes a bug in the previous one, which neglected to cleanly handle the LRU ring that fd.c uses to manage open files, and caused an unacceptable failure just before beta2 and was thus reverted.	2011-06-10 13:43:02 -04:00
Alvaro Herrera	9261557eb1	Revert "Use "transient" files for blind writes" This reverts commit `54d9e8c6c1`, which caused a failure on the buildfarm. Not a good thing to have just before a beta release.	2011-06-09 16:41:44 -04:00
Alvaro Herrera	54d9e8c6c1	Use "transient" files for blind writes "Blind writes" are a mechanism to push buffers down to disk when evicting them; since they may belong to different databases than the one a backend is connected to, the backend does not necessarily have a relation to link them to, and thus no way to blow them away. We were keeping those files open indefinitely, which would cause a problem if the underlying table was deleted, because the operating system would not be able to reclaim the disk space used by those files. To fix, have bufmgr mark such files as transient to smgr; the lower layer is allowed to close the file descriptor when the current transaction ends. We must be careful to have any other access of the file to remove the transient markings, to prevent unnecessary expensive system calls when evicting buffers belonging to our own database (which files we're likely to require again soon.)	2011-06-09 16:25:49 -04:00
Bruce Momjian	bf50caf105	pgindent run before PG 9.1 beta 1.	2011-04-10 11:42:00 -04:00
Bruce Momjian	5d950e3b0c	Stamp copyrights for year 2011.	2011-01-01 13:18:15 -05:00
Robert Haas	53dbc27c62	Support unlogged tables. The contents of an unlogged table are WAL-logged; thus, they are not available on standby servers and are truncated whenever the database system enters recovery. Indexes on unlogged tables are also unlogged. Unlogged GiST indexes are not currently supported.	2010-12-29 06:48:53 -05:00
Robert Haas	5f7b58fad8	Generalize concept of temporary relations to "relation persistence". This commit replaces pg_class.relistemp with pg_class.relpersistence; and also modifies the RangeVar node type to carry relpersistence rather than istemp. It also removes removes rd_istemp from RelationData and instead performs the correct computation based on relpersistence. For clarity, we add three new macros: RelationNeedsWAL(), RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we can clarify the purpose of each check that previous depended on rd_istemp. This is intended as infrastructure for the upcoming unlogged tables patch, as well as for future possible work on global temporary tables.	2010-12-13 12:34:26 -05:00
Robert Haas	c2281ac87c	Remove belt-and-suspenders guards against buffer pin leaks. Forcibly releasing all leftover buffer pins should be unnecessary now that we have a robust ResourceOwner mechanism, and it significantly increases the cost of process shutdown. Instead, in an assert-enabled build, assert that no pins are held; in a non-assert-enabled build, do nothing.	2010-11-25 00:06:46 -05:00
Magnus Hagander	9f2e211386	Remove cvs keywords from all files.	2010-09-20 22:08:53 +02:00
Robert Haas	a481ff71af	Remove the isLocalBuf argument from ReadBuffer_common. Since an SMgrRelation now knows whether or not the underlying relation is temporary, there's no point in also passing that information via an additional argument.	2010-08-20 01:07:50 +00:00
Robert Haas	27f145a40e	Further dtrace adjustments for the backend-IDs-in-relpath patch. Update the documentation, and back out a few ill-considered changes whose folly I failed to realize for failure to read the documentation.	2010-08-14 02:22:10 +00:00
Robert Haas	105d4c5ffe	Fix assorted dtrace breakage caused by patch to include backend IDs in temp relpaths. Per buildfarm.	2010-08-13 22:54:17 +00:00
Robert Haas	debcec7dc3	Include the backend ID in the relpath of temporary relations. This allows us to reliably remove all leftover temporary relation files on cluster startup without reference to system catalogs or WAL; therefore, we no longer include temporary relations in XLOG_XACT_COMMIT and XLOG_XACT_ABORT WAL records. Since these changes require including a backend ID in each SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id field has been reduced from two bytes to one, and the maximum number of connections has been reduced from INT_MAX / 4 to 2^23-1. It would be possible to remove these restrictions by increasing the size of SharedInvalidationMessage by 4 bytes, but right now that doesn't seem like a good trade-off. Review by Jaime Casanova and Tom Lane.	2010-08-13 20:10:54 +00:00
Bruce Momjian	65e806cba1	pgindent run for 9.0	2010-02-26 02:01:40 +00:00
Simon Riggs	959ac58c04	In HS, Startup process sets SIGALRM when waiting for buffer pin. If woken by alarm we send SIGUSR1 to all backends requesting that they check to see if they are blocking Startup process. If so, they throw ERROR/FATAL as for other conflict resolutions. Deadlock stop gap removed. max_standby_delay = -1 option removed to prevent deadlock.	2010-01-23 16:37:12 +00:00
Bruce Momjian	0239800893	Update copyright for the year 2010.	2010-01-02 16:58:17 +00:00
Robert Haas	cddca5ec13	Add an EXPLAIN (BUFFERS) option to show buffer-usage statistics. This patch also removes buffer-usage statistics from the track_counts output, since this (or the global server statistics) is deemed to be a better interface to this information. Itagaki Takahiro, reviewed by Euler Taveira de Oliveira.	2009-12-15 04:57:48 +00:00
Bruce Momjian	d747140279	8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list provided by Andrew.	2009-06-11 14:49:15 +00:00
Tom Lane	1b2bb33a54	Add a comment documenting the question of whether PrefetchBuffer should try to protect an already-existing buffer from being evicted. This was left as an open issue when the posix_fadvise patch was committed. I'm not sure there's any evidence to justify more work in this area, but we should have some record about it in the source code.	2009-04-03 18:17:43 +00:00
Tom Lane	948d6ec90f	Modify the relcache to record the temp status of both local and nonlocal temp relations; this is no more expensive than before, now that we have pg_class.relistemp. Insert tests into bufmgr.c to prevent attempting to fetch pages from nonlocal temp relations. This provides a low-level defense against bugs-of-omission allowing temp pages to be loaded into shared buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop. While at it, tweak a bunch of places to use new relcache tests (instead of expensive probes into pg_namespace) to detect local or nonlocal temp tables.	2009-03-31 22:12:48 +00:00
Tom Lane	471913a6a5	More fixes for 8.4 DTrace probes. Remove useless BUFFER_HIT/BUFFER_MISS probes --- the BUFFER_READ_DONE probe provides the same information and more besides. Expand the LOCK_WAIT_START/DONE probe arguments so that there's actually some chance of telling what is being waited for. Update and clean up the documentation.	2009-03-23 01:52:38 +00:00
Tom Lane	44023dc5f5	Add isExtend to the parameters of the buffer_read_start and buffer_read_done DTrace probes, so that ordinary reads can be distinguished from relation extension operations. Move buffer_read_start probe to before the smgrnblocks() call that's needed in the isExtend case, since really that step should be charged as part of the time needed for the extension operation. (This makes it slightly harder to match the read_start with the associated read_done, since now you can't match them on blockNumber, but it should still be possible since isExtend operations on the same relation can never be interleaved.) Per recent discussion. In passing, add the page identity (forkNum/blockNum) to the parameters of the buffer_flush_start/buffer_flush_done probes, which were unaccountably lacking the info.	2009-03-22 22:39:05 +00:00
Tom Lane	d287c9eff0	Restore previous ordering of BUFFER_FLUSH_START probe. I had wanted to make it include the time for the possible smgropen() call, but that results in a null pointer dereference :-(. An alternative solution would be to fetch the buffer tag instead of looking at *reln, but I'll just put it back as it was for the moment. BTW, this indicates that DTrace probes evaluate their arguments even when nominally inactive. What was that about "zero cost", again?	2009-03-13 17:46:21 +00:00
Tom Lane	e04810e8c4	Code review for dtrace probes added (so far) to 8.4. Adjust placement of some bufmgr probes, take out redundant and memory-leak-inducing path arguments to smgr__md__read__done and smgr__md__write__done, fix bogus attempt to recalculate space used in sort__done, clean up formatting in places where I'm not sure pgindent will do a nice job by itself.	2009-03-11 23:19:25 +00:00
Tom Lane	b7b8f0b609	Implement prefetching via posix_fadvise() for bitmap index scans. A new GUC variable effective_io_concurrency controls how many concurrent block prefetch requests will be issued. (The best way to handle this for plain index scans is still under debate, so that part is not applied yet --- tgl) Greg Stark	2009-01-12 05:10:45 +00:00
Bruce Momjian	511db38ace	Update copyright for 2009.	2009-01-01 17:24:05 +00:00
Bruce Momjian	5a90bc1fbe	The attached patch contains a couple of fixes in the existing probes and includes a few new ones. - Fixed compilation errors on OS X for probes that use typedefs - Fixed a number of probes to pass ForkNumber per the relation forks patch - The new probes are those that were taken out from the previous submitted patch and required simple fixes. Will submit the other probes that may require more discussion in a separate patch. Robert Lor	2008-12-17 01:39:04 +00:00
Heikki Linnakangas	3396000684	Rethink the way FSM truncation works. Instead of WAL-logging FSM truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To make that cleaner from modularity point of view, move the WAL-logging one level up to RelationTruncate, and move RelationTruncate and all the related WAL-logging to new src/backend/catalog/storage.c file. Introduce new RelationCreateStorage and RelationDropStorage functions that are used instead of calling smgrcreate/smgrscheduleunlink directly. Move the pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new functions. This leaves smgr.c as a thin wrapper around md.c; all the transactional stuff is now in storage.c. This will make it easier to add new forks with similar truncation logic, like the visibility map.	2008-11-19 10:34:52 +00:00
Heikki Linnakangas	7e8b0b9ab1	Change error messages to print the physical path, like "base/11517/3767_fsm", instead of symbolic names like "1663/11517/3767/1", per Alvaro's suggestion. I didn't change the messages in the higher-level index, heap and FSM routines, though, where the fork is implicit.	2008-11-11 13:19:16 +00:00
Heikki Linnakangas	19c8dc839b	Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer functions into one ReadBufferExtended function, that takes the strategy and mode as argument. There's three modes, RBM_NORMAL which is the default used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages without throwing an error. The FSM needs the new mode to recover from corrupt pages, which could happend if we crash after extending an FSM file, and the new page is "torn". Add fork number to some error messages in bufmgr.c, that still lacked it.	2008-10-31 15:05:00 +00:00
Alvaro Herrera	089ae3bc9a	Properly access a buffer's LSN using existing access macros instead of abusing knowledge of page layout. Stolen from Jonah Harris' CRC patch	2008-10-20 21:11:15 +00:00
Tom Lane	35c2a3c3cf	Allow ShowBufferUsage() to report the number of reads/writes that have occurred to temporary files. This replaces the unused NDirectFileRead/NDirectFileWrite counters. Itagaki Takahiro	2008-09-17 13:15:55 +00:00
Heikki Linnakangas	3f0e808c4a	Introduce the concept of relation forks. An smgr relation can now consist of multiple forks, and each fork can be created and grown separately. The bulk of this patch is about changing the smgr API to include an extra ForkNumber argument in every smgr function. Also, smgrscheduleunlink and smgrdounlink no longer implicitly call smgrclose, because other forks might still exist after unlinking one. The callers of those functions have been modified to call smgrclose instead. This patch in itself doesn't have any user-visible effect, but provides the infrastructure needed for upcoming patches. The additional forks envisioned are a rewritten FSM implementation that doesn't rely on a fixed-size shared memory block, and a visibility map to allow skipping portions of a table in VACUUM that have no dead tuples.	2008-08-11 11:05:11 +00:00
Tom Lane	d8b04d5fac	In ReadOrZeroBuffer (and related entry points), don't bother to call PageHeaderIsValid when we zero the buffer instead of reading the page in. The actual performance improvement is probably marginal since this function isn't very heavily used, but a cycle saved is a cycle earned. Zdenek Kotala	2008-08-05 15:09:04 +00:00
Alvaro Herrera	e36e6b1cab	Add a few more DTrace probes to the backend. Robert Lor	2008-08-01 13:16:09 +00:00
Tom Lane	9d035f4254	Clean up the use of some page-header-access macros: principally, use SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that makes the code clearer, and avoid casting between Page and PageHeader where possible. Zdenek Kotala, with some additional cleanup by Heikki Linnakangas. I did not apply the parts of the proposed patch that would have resulted in slightly changing the on-disk format of hash indexes; it seems to me that's not a win as long as there's any chance of having in-place upgrade for 8.4.	2008-07-13 20:45:47 +00:00
Alvaro Herrera	a3540b0f65	Improve our #include situation by moving pointer types away from the corresponding struct definitions. This allows other headers to avoid including certain highly-loaded headers such as rel.h and relscan.h, instead using just relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less unnecessary dependencies.	2008-06-19 00:46:06 +00:00
Heikki Linnakangas	a213f1ee6c	Refactor XLogOpenRelation() and XLogReadBuffer() in preparation for relation forks. XLogOpenRelation() and the associated light-weight relation cache in xlogutils.c is gone, and XLogReadBuffer() now takes a RelFileNode as argument, instead of Relation. For functions that still need a Relation struct during WAL replay, there's a new function called CreateFakeRelcacheEntry() that returns a fake entry like XLogOpenRelation() used to.	2008-06-12 09:12:31 +00:00
Alvaro Herrera	cc87402d6e	Move BufferGetPageSize and BufferGetPage from bufpage.h to bufmgr.h. It is more logical that way, and also it reduces the amount of unnecessary includes in bufpage.h, which is widely used. Zdenek Kotala. My previous patch to bufpage.h should also have credited him as author, but I forgot (sorry about that).	2008-06-08 22:00:48 +00:00
Alvaro Herrera	9084399782	Put back bufmgr.h in bufpage.h -- it is needed by some macros. Remove #include bufmgr.h from (most?) source files which already include bufpage.h.	2008-05-12 16:06:10 +00:00
Alvaro Herrera	f8c4d7db60	Restructure some header files a bit, in particular heapam.h, by removing some unnecessary #include lines in it. Also, move some tuple routine prototypes and macros to htup.h, which allows removal of heapam.h inclusion from some .c files. For this to work, a new header file access/sysattr.h needed to be created, initially containing attribute numbers of system columns, for pg_dump usage. While at it, make contrib ltree, intarray and hstore header files more consistent with our header style.	2008-05-12 00:00:54 +00:00
Bruce Momjian	9098ab9e32	Update copyrights in source tree to 2008.	2008-01-01 19:46:01 +00:00
Bruce Momjian	fdf5a5efb7	pgindent run for 8.3.	2007-11-15 21:14:46 +00:00
Tom Lane	7a315a09dc	Dept. of second thoughts: fix loop in BgBufferSync so that the exit when bgwriter_lru_maxpages is exceeded leaves the loop variables in the expected state. In the original coding, we'd fail to advance next_to_clean, causing that buffer to be probably-uselessly rechecked next time, and also have an off-by-one idea of the number of buffers scanned.	2007-09-25 22:11:48 +00:00
Tom Lane	6f5c38dcd0	Just-in-time background writing strategy. This code avoids re-scanning buffers that cannot possibly need to be cleaned, and estimates how many buffers it should try to clean based on moving averages of recent allocation requests and density of reusable buffers. The patch also adds a couple more columns to pg_stat_bgwriter to help measure the effectiveness of the bgwriter. Greg Smith, building on his own work and ideas from several other people, in particular a much older patch from Itagaki Takahiro.	2007-09-25 20:03:38 +00:00
Tom Lane	282d2a03dd	HOT updates. When we update a tuple without changing any of its indexed columns, and the new version can be stored on the same heap page, we no longer generate extra index entries for the new version. Instead, index searches follow the HOT-chain links to ensure they find the correct tuple version. In addition, this patch introduces the ability to "prune" dead tuples on a per-page basis, without having to do a complete VACUUM pass to recover space. VACUUM is still needed to clean up dead index entries, however. Pavan Deolasee, with help from a bunch of other people.	2007-09-20 17:56:33 +00:00
Tom Lane	9fc25c0511	Improve logging of checkpoints. Patch by Greg Smith, worked over by Heikki and a little bit by me.	2007-06-30 19:12:02 +00:00
Tom Lane	867e2c91a0	Implement "distributed" checkpoints in which the checkpoint I/O is spread over a fairly long period of time, rather than being spat out in a burst. This happens only for background checkpoints carried out by the bgwriter; other cases, such as a shutdown checkpoint, are still done at full speed. Remove the "all buffers" scan in the bgwriter, and associated stats infrastructure, since this seems no longer very useful when the checkpoint itself is properly throttled. Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas, and some minor API editorialization by me.	2007-06-28 00:02:40 +00:00
Tom Lane	de6a6383a7	Update obsolete comment: it's no longer the case that mdread() will allow reads beyond EOF, except by special coercion.	2007-06-18 00:47:20 +00:00
Tom Lane	d526575f89	Make large sequential scans and VACUUMs work in a limited-size "ring" of buffers, rather than blowing out the whole shared-buffer arena. Aside from avoiding cache spoliation, this fixes the problem that VACUUM formerly tended to cause a WAL flush for every page it modified, because we had it hacked to use only a single buffer. Those flushes will now occur only once per ring-ful. The exact ring size, and the threshold for seqscans to switch into the ring usage pattern, remain under debate; but the infrastructure seems done. The key bit of infrastructure is a new optional BufferAccessStrategy object that can be passed to ReadBuffer operations; this replaces the former StrategyHintVacuum API. This patch also changes the buffer usage-count methodology a bit: we now advance usage_count when first pinning a buffer, rather than when last unpinning it. To preserve the behavior that a buffer's lifetime starts to decrease when it's released, the clock sweep code is modified to not decrement usage_count of pinned buffers. Work not done in this commit: teach GiST and GIN indexes to use the vacuum BufferAccessStrategy for vacuum-driven fetches. Original patch by Simon, reworked by Heikki and again by Tom.	2007-05-30 20:12:03 +00:00
Tom Lane	77947c51c0	Fix up pgstats counting of live and dead tuples to recognize that committed and aborted transactions have different effects; also teach it not to assume that prepared transactions are always committed. Along the way, simplify the pgstats API by tying counting directly to Relations; I cannot detect any redeeming social value in having stats pointers in HeapScanDesc and IndexScanDesc structures. And fix a few corner cases in which counts might be missed because the relation's pgstat_info pointer hadn't been set.	2007-05-27 03:50:39 +00:00
Tom Lane	63735ca815	Dept. of second thoughts: add comments cautioning against using ReadOrZeroBuffer to fetch pages from beyond physical EOF. This would usually work, but would cause problems for md.c if writes occurred beyond a segment boundary when the previous segment file hadn't been fully extended.	2007-05-02 23:34:48 +00:00
Tom Lane	8c3cc86e7b	During WAL recovery, when reading a page that we intend to overwrite completely from the WAL data, don't bother to physically read it; just have bufmgr.c return a zeroed-out buffer instead. This speeds recovery significantly, and also avoids unnecessary failures when a page-to-be-overwritten has corrupt page headers on disk. This replaces a former kluge that accomplished the latter by pretending zero_damaged_pages was always ON during WAL recovery; which was OK when the kluge was put in, but is unsafe when restoring a WAL log that was written with full_page_writes off. Heikki Linnakangas	2007-05-02 23:18:03 +00:00
Magnus Hagander	335feca441	Add some instrumentation to the bgwriter, through the stats collector. New view pg_stat_bgwriter, and the functions required to build it.	2007-03-30 18:34:56 +00:00
Bruce Momjian	8b4ff8b6a1	Wording cleanup for error messages. Also change can't -> cannot. Standard English uses "may", "can", and "might" in different ways: may - permission, "You may borrow my rake." can - ability, "I can lift that log." might - possibility, "It might rain today." Unfortunately, in conversational English, their use is often mixed, as in, "You may use this variable to do X", when in fact, "can" is a better choice. Similarly, "It may crash" is better stated, "It might crash".	2007-02-01 19:10:30 +00:00
Bruce Momjian	29dccf5fe0	Update CVS HEAD for 2007 copyright. Back branches are typically not back-stamped for this.	2007-01-05 22:20:05 +00:00
Tom Lane	954c1813ac	Remove an unnecessary HOLD_INTERRUPTS/RESUME_INTERRUPTS pair. This was required back when RESUME_INTERRUPTS could actually execute ProcessInterrupts, but that hasn't been true since 2001...	2006-10-22 20:34:54 +00:00
Bruce Momjian	f99a569a2e	pgindent run for 8.2.	2006-10-04 00:30:14 +00:00
Tom Lane	ffae5cc5a6	Add a check to prevent overwriting valid data if smgrnblocks() gives a wrong answer, as has been seen to occur with a buggy Linux kernel. Not really our bug, but it's a simple test in a seldom-used control path, so might as well have a defense.	2006-09-25 22:01:10 +00:00
Tom Lane	2e5e856f6b	Marginal cleanup in arrangements for ensuring StrategyHintVacuum is cleared after an error during VACUUM. We have a PG_TRY block anyway around the only call sites, so just reset it in the CATCH clause instead of having AtEOXact_Buffers blindly do it during xact end. I think the old code was actively wrong for the case of a failure during ANALYZE inside a subtransaction --- the flag wouldn't get cleared until main transaction end. Probably not worth back-patching though.	2006-09-17 22:16:22 +00:00
Tom Lane	10b9ca3d05	Split the buffer mapping table into multiple separately lockable partitions, as per discussion. Passes functionality checks, but I don't have any performance data yet.	2006-07-23 03:07:58 +00:00
Bruce Momjian	e0522505bd	Remove 576 references of include files that were not needed.	2006-07-14 14:52:27 +00:00
Tom Lane	0fcc3c2f1d	Repair a low-probability race condition identified by Qingqing Zhou. If a process abandons a wait in LockBufferForCleanup (in practice, only happens if someone cancels a VACUUM) just before someone else sends it a signal indicating the buffer is available, it was possible for the wakeup to remain in the process' semaphore, causing misbehavior next time the process waited for an lmgr lock. Rather than try to prevent the race condition directly, it seems best to make the lock manager robust against leftover wakeups, by having it repeat waiting on the semaphore if the lock has not actually been granted or denied yet.	2006-04-14 03:38:56 +00:00
Tom Lane	a8b8f4db23	Clean up WAL/buffer interactions as per my recent proposal. Get rid of the misleadingly-named WriteBuffer routine, and instead require routines that change buffer pages to call MarkBufferDirty (which does exactly what it says). We also require that they do so before calling XLogInsert; this takes care of the synchronization requirement documented in SyncOneBuffer. Note that because bufmgr takes the buffer content lock (in shared mode) while writing out any buffer, it doesn't matter whether MarkBufferDirty is executed before the buffer content change is complete, so long as the content change is completed before releasing exclusive lock on the buffer. So it's OK to set the dirtybit before we fill in the LSN. This eliminates the former kluge of needing to set the dirtybit in LockBuffer. Aside from making the code more transparent, we can also add some new debugging assertions, in particular that the caller of MarkBufferDirty must hold the buffer content lock, not merely a pin.	2006-03-31 23:32:07 +00:00
Tom Lane	6d61cdec07	Clean up and document the API for XLogOpenRelation and XLogReadBuffer. This commit doesn't make much functional change, but it does eliminate some duplicated code --- for instance, PageIsNew tests are now done inside XLogReadBuffer rather than by each caller. The GIST xlog code still needs a lot of love, but I'll worry about that separately.	2006-03-29 21:17:39 +00:00
Bruce Momjian	f2f5b05655	Update copyright for 2006. Update scripts.	2006-03-05 15:59:11 +00:00
Tom Lane	9a506a6257	Arrange to call AbsorbFsyncRequests every so often while performing a checkpoint in the bgwriter. This forestalls overflow of the fsync request queue, which is not fatal but causes considerable performance degradation when it occurs (because backends then have to do their own fsyncs). Per patch from Itagaki Takahiro, modified a little bit by me.	2006-03-03 00:02:02 +00:00
Tom Lane	304160c3e2	Fix ReadBuffer() to correctly handle the case where it's trying to extend the relation but it finds a pre-existing valid buffer. The buffer does not correspond to any page known to the kernel, so we must do smgrextend to ensure that the space becomes allocated. The 7.x branches all do this correctly, but the corner case got lost somewhere during 8.0 bufmgr rewrites. (My fault no doubt :-( ... I think I assumed that such a buffer must be not-BM_VALID, which is not so.)	2006-01-06 00:04:20 +00:00
Tom Lane	195f164228	Get rid of the SpinLockAcquire/SpinLockAcquire_NoHoldoff distinction in favor of having just one set of macros that don't do HOLD/RESUME_INTERRUPTS (hence, these correspond to the old SpinLockAcquire_NoHoldoff case). Given our coding rules for spinlock use, there is no reason to allow CHECK_FOR_INTERRUPTS to be done while holding a spinlock, and also there is no situation where ImmediateInterruptOK will be true while holding a spinlock. Therefore doing HOLD/RESUME_INTERRUPTS while taking/releasing a spinlock is just a waste of cycles. Qingqing Zhou and Tom Lane.	2005-12-29 18:08:05 +00:00
Bruce Momjian	436a2956d8	Re-run pgindent, fixing a problem where comment lines after a blank comment line where output as too long, and update typedefs for /lib directory. Also fix case where identifiers were used as variable names in the backend, but as typedefs in ecpg (favor the backend for indenting). Backpatch to 8.1.X.	2005-11-22 18:17:34 +00:00
Tom Lane	c859308aba	DropRelFileNodeBuffers failed to fix the state of the lookup hash table that was added to localbuf.c in 8.1; therefore, applying it to a temp table left corrupt lookup state in memory. The only case where this had a significant chance of causing problems was an ON COMMIT DELETE ROWS temp table; the other possible paths left bogus state that was unlikely to be used again. Per report from Csaba Nagy.	2005-11-17 17:42:02 +00:00
Tom Lane	fbbe00242d	Tweak buffer manager so that 'internal' accesses to a buffer do not advance its usage_count. This includes writes of dirty buffers triggered by bgwriter, checkpoint, or FlushRelationBuffers, as well as various corner cases that really ought not count as accesses to the page. Should make for some marginal improvement in the quality of our decisions about when to recycle buffers. Per suggestion from ITAGAKI Takahiro.	2005-10-27 17:07:58 +00:00
Bruce Momjian	1dc3498251	Standard pgindent run for 8.1.	2005-10-15 02:49:52 +00:00
Tom Lane	07eeb9d109	Do all accesses to shared buffer headers through volatile-qualified pointers, to ensure that compilers won't rearrange accesses to occur while we're not holding the buffer header spinlock. It's probably not necessary to mark volatile in every single place in bufmgr.c, but better safe than sorry. Per trouble report from Kevin Grittner.	2005-10-12 16:45:14 +00:00
Bruce Momjian	27639809d2	Reverse out Assert addition.	2005-08-12 23:13:54 +00:00
Bruce Momjian	fab177e64f	Improve documention on loading large data sets into plperl. David Fetter	2005-08-12 21:42:53 +00:00
Tom Lane	3ae7e4a33b	Remove BufferBlockPointers array in favor of a base + (bufnum) * BLCKSZ computation. On modern machines this is as fast if not faster, and we don't have to clog the CPU's L2 cache with a tens-of-KB pointer array. If we ever decide to adopt a more dynamic allocation method for shared buffers, we'll probably have to revert this patch, but in the meantime we might as well save a few bytes and nanoseconds. Per Qingqing Zhou.	2005-08-12 05:05:51 +00:00
Tom Lane	15269b5955	Avoid useless loop overhead in AtEOXact routines when the backend is compiled with USE_ASSERT_CHECKING but is running with assert_enabled false.	2005-08-08 19:44:22 +00:00
Tom Lane	7117cd3a77	Cause ShutdownPostgres to do a normal transaction abort during backend exit, instead of trying to take shortcuts. Introduce some additional shutdown callback routines to eliminate kluges like having ProcKill be responsible for shutting down the buffer manager. Ensure that the order of operations during shutdown is predictable and what you would expect given the module layering.	2005-08-08 03:12:16 +00:00
Tom Lane	6eac4e69cf	Tweak BgBufferSync() so that a persistent write error on a dirty buffer doesn't block the bgwriter from making progress writing out other buffers. This was a hard problem in the context of the ARC/2Q design, but it's trivial in the context of clock sweep ... just advance the sweep counter before we try to write not after.	2005-08-02 20:52:08 +00:00
Tom Lane	ee3b71f6bc	Split the shared-memory array of PGPROC pointers out of the sinval communication structure, and make it its own module with its own lock. This should reduce contention at least a little, and it definitely makes the code seem cleaner. Per my recent proposal.	2005-05-19 21:35:48 +00:00
Tom Lane	354049c709	Remove unnecessary calls of FlushRelationBuffers: there is no need to write out data that we are about to tell the filesystem to drop. smgr_internal_unlink already had a DropRelFileNodeBuffers call to get rid of dead buffers without a write after it's no longer possible to roll back the deleting transaction. Adding a similar call in smgrtruncate simplifies callers and makes the overall division of labor clearer. This patch removes the former behavior that VACUUM would write all dirty buffers of a relation unconditionally.	2005-03-20 22:00:54 +00:00
Tom Lane	cef01c3355	Avoid infinite loop in InvalidateBuffer if we ourselves are holding a pin on the victim buffer.	2005-03-18 05:25:23 +00:00
Tom Lane	5d5087363d	Replace the BufMgrLock with separate locks on the lookup hashtable and the freelist, plus per-buffer spinlocks that protect access to individual shared buffer headers. This requires abandoning a global freelist (since the freelist is a global contention point), which shoots down ARC and 2Q as well as plain LRU management. Adopt a clock sweep algorithm instead. Preliminary results show substantial improvement in multi-backend situations.	2005-03-04 20:21:07 +00:00
Tom Lane	0ce4d56924	Phase 1 of fix for 'SMgrRelation hashtable corrupted' problem. This is the minimum required fix. I want to look next at taking advantage of it by simplifying the message semantics in the shared inval message queue, but that part can be held over for 8.1 if it turns out too ugly.	2005-01-10 20:02:24 +00:00
Tom Lane	c9d8edc906	Repair bufmgr deadlock problem reported by Michael Wildpaner. Must take share lock on a buffer being written out before releasing BufMgrLock in the BufferAlloc code path; if we do it later we might block on someone who's re-pinned the buffer. I believe this is only an issue for BufferAlloc and not the other places that call FlushBuffer. BufferSync must continue to do it the old way since it may well be trying to write buffers that other backends have pinned; but it should not be holding any conflicting locks. FlushRelationBuffers is okay since it's got exclusive lock at the relation level.	2005-01-03 18:49:41 +00:00
PostgreSQL Daemon	2ff501590b	Tag appropriate files for rc3 Also performed an initial run through of upgrading our Copyright date to extend to 2005 ... first run here was very simple ... change everything where: grep 1996-2004 && the word 'Copyright' ... scanned through the generated list with 'less' first, and after, to make sure that I only picked up the right entries ...	2004-12-31 22:04:05 +00:00
Neil Conway	4acc97d7e4	Assert that BufferIsPinned() in IncrBufferRefCount(), rather than using a home-brewed combination of assertions that boiled down to the same thing.	2004-11-24 02:56:17 +00:00
Tom Lane	4347cc2392	Allow background writing to be shut down by setting limit values to zero. This does not disable the bgwriter process: it still has to wake up often enough to collect fsync requests from backends in a timely fashion. But it responds to the recent gripe about not being able to prevent the disk from being spun up constantly.	2004-10-17 22:01:51 +00:00
Tom Lane	fdd13f1568	Give the ResourceOwner mechanism full responsibility for releasing buffer pins at end of transaction, and reduce AtEOXact_Buffers to an Assert cross-check that this was done correctly. When not USE_ASSERT_CHECKING, AtEOXact_Buffers is a complete no-op. This gets rid of an O(NBuffers) bottleneck during transaction commit/abort, which recent testing has shown becomes significant above a few tens of thousands of shared buffers.	2004-10-16 18:57:26 +00:00
Tom Lane	1c2de47746	Remove BufferLocks[] array in favor of a single pointer to the buffer (if any) currently waited for by LockBufferForCleanup(), which is all that we were using it for anymore. Saves some space and eliminates proportional-to-NBuffers slowdown in UnlockBuffers().	2004-10-16 18:05:07 +00:00
Tom Lane	9ffc8ed58b	Repair possible failure to update hint bits back to disk, per http://archives.postgresql.org/pgsql-hackers/2004-10/msg00464.php. This fix is intended to be permanent: it moves the responsibility for calling SetBufferCommitInfoNeedsSave() into the tqual.c routines, eliminating the requirement for callers to test whether t_infomask changed. Also, tighten validity checking on buffer IDs in bufmgr.c --- several routines were paranoid about out-of-range shared buffer numbers but not about out-of-range local ones, which seems a tad pointless.	2004-10-15 22:40:29 +00:00
Tom Lane	eb917c1a21	I can't see any good reason for DropRelFileNodeBuffers to be issuing FATAL when it detects a nonzero reference count. Reduce to ERROR.	2004-09-06 17:31:32 +00:00
Tom Lane	a421b4e850	FlushRelationBuffers was also being a bit cavalier about whether the relation is already opened by smgr.	2004-08-31 16:13:06 +00:00
Bruce Momjian	b6b71b85bc	Pgindent run for 8.0.	2004-08-29 05:07:03 +00:00
Bruce Momjian	da9a8649d8	Update copyright to 2004.	2004-08-29 04:13:13 +00:00
Tom Lane	fe548629c5	Invent ResourceOwner mechanism as per my recent proposal, and use it to keep track of portal-related resources separately from transaction-related resources. This allows cursors to work in a somewhat sane fashion with nested transactions. For now, cursor behavior is non-subtransactional, that is a cursor's state does not roll back if you abort a subtransaction that fetched from the cursor. We might want to change that later.	2004-07-17 03:32:14 +00:00
Tom Lane	573a71a5da	Nested transactions. There is still much left to do, especially on the performance front, but with feature freeze upon us I think it's time to drive a stake in the ground and say that this will be in 7.5. Alvaro Herrera, with some help from Tom Lane.	2004-07-01 00:52:04 +00:00
Tom Lane	2467394ee1	Tablespaces. Alternate database locations are dead, long live tablespaces. There are various things left to do: contrib dbsize and oid2name modules need work, and so does the documentation. Also someone should think about COMMENT ON TABLESPACE and maybe RENAME TABLESPACE. Also initlocation is dead, it just doesn't know it yet. Gavin Sherry and Tom Lane.	2004-06-18 06:14:31 +00:00
Tom Lane	e6cba71503	Add some code to Assert that when we release pin on a buffer, we are not holding the buffer's cntx_lock or io_in_progress_lock. A recent report from Litao Wu makes me wonder whether it is ever possible for us to drop a buffer and forget to release its cntx_lock. The Assert does not fire in the regression tests, but that proves little ...	2004-06-11 16:43:24 +00:00
Tom Lane	91d20ff7aa	Additional mop-up for sync-to-fsync changes: avoid issuing fsyncs for temp tables, and avoid WAL-logging truncations of temp tables. Do issue fsync on truncated files (not sure this is necessary but it seems like a good idea).	2004-05-31 20:31:33 +00:00
Tom Lane	e674707968	Minor code rationalization: FlushRelationBuffers just returns void, rather than an error code, and does elog(ERROR) not elog(WARNING) when it detects a problem. All callers were simply elog(ERROR)'ing on failure return anyway, and I find it hard to envision a caller that would not, so we may as well simplify the callers and produce the more useful error message directly.	2004-05-31 19:24:05 +00:00
Tom Lane	9b178555fc	Per previous discussions, get rid of use of sync(2) in favor of explicitly fsync'ing every (non-temp) file we have written since the last checkpoint. In the vast majority of cases, the burden of the fsyncs should fall on the bgwriter process not on backends. (To this end, we assume that an fsync issued by the bgwriter will force out blocks written to the same file by other processes using other file descriptors. Anyone have a problem with that?) This makes the world safe for WIN32, which ain't even got sync(2), and really makes the world safe for Unixen as well, because sync(2) never had the semantics we need: it offers no way to wait for the requested I/O to finish. Along the way, fix a bug I recently introduced in xlog recovery: file truncation replay failed to clear bufmgr buffers for the dropped blocks, which could result in 'PANIC: heap_delete_redo: no block' later on in xlog replay.	2004-05-31 03:48:10 +00:00
Tom Lane	076a055acf	Separate out bgwriter code into a logically separate module, rather than being random pieces of other files. Give bgwriter responsibility for all checkpoint activity (other than a post-recovery checkpoint); so this child process absorbs the functionality of the former transient checkpoint and shutdown subprocesses. While at it, create an actual include file for postmaster.c, which for some reason never had its own file before.	2004-05-29 22:48:23 +00:00
Tom Lane	4af3421161	Get rid of rd_nblocks field in relcache entries. Turns out this was costing us lots more to maintain than it was worth. On shared tables it was of exactly zero benefit because we couldn't trust it to be up to date. On temp tables it sometimes saved an lseek, but not often enough to be worth getting excited about. And the real problem was that we forced an lseek on every relcache flush in order to update the field. So all in all it seems best to lose the complexity.	2004-05-08 19:09:25 +00:00
Neil Conway	0370951347	Tiny assorted fixes: correct a typo in a comment in vacuumlazy.c, remove some unused #include directives from bufmgr.c, and clarify comments in bufmgr.h and buf.h	2004-04-25 23:50:58 +00:00
Neil Conway	139abc2896	Make LocalRefCount and PrivateRefCount arrays of int32, rather than long. This saves a small amount of per-backend memory for LP64 machines.	2004-04-22 07:21:55 +00:00
Tom Lane	95a03e9cdf	Another round of code cleanup on bufmgr. Use BM_VALID flag to keep track of whether we have successfully read data into a buffer; this makes the error behavior a bit more transparent (IMHO anyway), and also makes it work correctly for local buffers which don't use Start/TerminateBufferIO. Collapse three separate functions for writing a shared buffer into one. This overlaps a bit with cleanups that Neil proposed awhile back, but seems not to have committed yet.	2004-04-21 18:06:30 +00:00
Tom Lane	011c3e62e7	Code review for ARC patch. Eliminate static variables, improve handling of VACUUM cases so that VACUUM requests don't affect the ARC state at all, avoid corner case where BufferSync would uselessly rewrite a buffer that no longer contains the page that was to be flushed. Make some minor other cleanups in and around the bufmgr as well, such as moving PinBuffer and UnpinBuffer into bufmgr.c where they really belong.	2004-04-19 23:27:17 +00:00
Tom Lane	da99cce7cd	Avoid delaying postmaster shutdown by up to 10 seconds on platforms where signals do not terminate sleep() delays.	2004-02-12 20:07:26 +00:00
Jan Wieck	fc65a3e1fd	Fixed bug where FlushRelationBuffers() did call StrategyInvalidateBuffer() for already empty buffers because their buffer tag was not cleard out when the buffers have been invalidated before. Also removed the misnamed BM_FREE bufhdr flag and replaced the checks, which effectively ask if the buffer is unpinned, with checks against the refcount field. Jan	2004-02-12 15:06:56 +00:00
Tom Lane	58f337a343	Centralize implementation of delay code by creating a pg_usleep() subroutine in src/port/pgsleep.c. Remove platform dependencies from miscadmin.h and put them in port.h where they belong. Extend recent vacuum cost-based-delay patch to apply to VACUUM FULL, ANALYZE, and non-btree index vacuuming. By the way, where is the documentation for the cost-based-delay patch?	2004-02-10 03:42:45 +00:00
Tom Lane	87bd956385	Restructure smgr API as per recent proposal. smgr no longer depends on the relcache, and so the notion of 'blind write' is gone. This should improve efficiency in bgwriter and background checkpoint processes. Internal restructuring in md.c to remove the not-very-useful array of MdfdVec objects --- might as well just use pointers. Also remove the long-dead 'persistent main memory' storage manager (mm.c), since it seems quite unlikely to ever get resurrected.	2004-02-10 01:55:27 +00:00
Jan Wieck	f425b605f4	Cost based vacuum delay feature. Jan	2004-02-06 19:36:18 +00:00
Jan Wieck	8d09e25693	Backing out the background writer sync() option. Jan	2004-02-04 01:24:53 +00:00
Bruce Momjian	5ee2ae2049	Remove sleep() and use single PG_SLEEP call for Win32 signal handling and consistency. Change PG_USLEEP to use SleepEx() for signal interuptability.	2004-01-30 15:57:04 +00:00
Jan Wieck	d77b63b17c	Added GUC variable bgwriter_flush_method controlling the action done by the background writer between writing dirty blocks and napping. none (default) no action sync bgwriter calls smgrsync() causing a sync(2) A global sync() is only good on dedicated database servers, so more flush methods should be added in the future. Jan	2004-01-24 20:00:46 +00:00
Bruce Momjian	38081fd000	Change PG_DELAY from msec to usec and use it consistenly rather than select(). Add Win32 Sleep() for delay.	2004-01-09 21:08:50 +00:00
Neil Conway	192ad63bd7	More janitorial work: remove the explicit casting of NULL literals to a pointer type when it is not necessary to do so. For future reference, casting NULL to a pointer type is only necessary when (a) invoking a function AND either (b) the function has no prototype OR (c) the function is a varargs function.	2004-01-07 18:56:30 +00:00
Tom Lane	16cc9dff4f	bufmgr.c failed to compile on Darwin, because it didn't include <sys/time.h> where struct timeval is defined.	2003-12-20 22:18:02 +00:00
Neil Conway	fef0c8345a	I posted some bufmgr cleanup a few weeks ago, but it conflicted with some concurrent changes Jan was making to the bufmgr. Here's an updated version of the patch -- it should apply cleanly to CVS HEAD and passes the regression tests. This patch makes the following changes: - remove the UnlockAndReleaseBuffer() and UnlockAndWriteBuffer() macros, and replace uses of them with calls to the appropriate functions. - remove a bunch of #ifdef BMTRACE code: it is ugly & broken (i.e. it doesn't compile) - make BufferReplace() return a bool, not an int - cleanup some logic in bufmgr.c; should be functionality equivalent to the previous code, just cleaner now - remove the BM_PRIVATE flag as it is unused - improve a few comments, etc.	2003-12-14 00:34:47 +00:00
Tom Lane	0902ece5b9	Force zero_damaged_pages to be effectively ON during recovery from WAL, since there is no need to worry about damaged pages when we are going to overwrite them anyway from the WAL. Per recent discussion.	2003-12-01 16:53:19 +00:00
PostgreSQL Daemon	969685ad44	$Header: -> $PostgreSQL Changes ...	2003-11-29 19:52:15 +00:00
Tom Lane	0a97cb37fc	Remove unused variable.	2003-11-21 17:41:31 +00:00
Jan Wieck	cfeca62148	Background writer process This first part of the background writer does no syncing at all. It's only purpose is to keep the LRU heads clean so that regular backends seldom to never have to call write(). Jan	2003-11-19 15:55:08 +00:00
Jan Wieck	6b86d62b00	2nd try for the ARC strategy. I added a couple more Assertions while tracking down the exact cause of the former bug. All 93 regression tests pass now. Jan	2003-11-13 14:57:15 +00:00
Jan Wieck	923e994d79	ARC strategy backed out ... sorry Jan	2003-11-13 05:34:58 +00:00
Jan Wieck	48adc0b34b	Replacement of the buffer replacement strategy with an ARC algorithm adopted for PostgreSQL. Jan	2003-11-13 00:40:02 +00:00
Peter Eisentraut	feb4f44d29	Message editing: remove gratuitous variations in message wording, standardize terms, add some clarifications, fix some untranslatable attempts at dynamic message building.	2003-09-25 06:58:07 +00:00
Tom Lane	ffafacc1f6	Repair potential deadlock created by recent changes to recycle btree index pages: when _bt_getbuf asks the FSM for a free index page, it is possible (and, in some cases, even moderately likely) that the answer will be the same page that _bt_split is trying to split. _bt_getbuf already knew that the returned page might not be free, but it wasn't prepared for the possibility that even trying to lock the page could be problematic. Fix by doing a conditional rather than unconditional grab of the page lock.	2003-08-10 19:48:08 +00:00
Bruce Momjian	f3c3deb7d0	Update copyrights to 2003.	2003-08-04 02:40:20 +00:00
Bruce Momjian	089003fb46	pgindent run.	2003-08-04 00:43:34 +00:00
Tom Lane	cfa191f3b8	Error message editing in backend/storage.	2003-07-24 22:04:15 +00:00
Tom Lane	a4e775a263	Make use of new error context stack mechanism to allow random errors detected during buffer dump to be labeled with the buffer location. For example, if a page LSN is clobbered, we now produce something like ERROR: XLogFlush: request 2C000000/8468EC8 is not satisfied --- flushed only to 0/8468EF0 CONTEXT: writing block 0 of relation 428946/566240 whereas before there was no convenient way to find out which page had been trashed.	2003-05-10 19:04:30 +00:00
Tom Lane	fd42262836	Add code to apply some simple sanity checks to the header fields of a page when it's read in, per pghackers discussion around 17-Feb. Add a GUC variable zero_damaged_pages that causes the response to be a WARNING followed by zeroing the page, rather than the normal ERROR; this is per Hiroshi's suggestion that there needs to be a way to get at the data in the rest of the table.	2003-03-28 20:17:13 +00:00
Bruce Momjian	48ee6f4916	This trivial patch removes the usage of some old statistics code that no longer works -- IncrHeapAccessStat() didn't actually do anything anymore, so no reason to keep it around AFAICS. I also fixed a grammatical error in a comment. Neil Conway	2003-02-13 05:35:11 +00:00
Tom Lane	c91b8bc537	Cosmetic fixes from Neil Conway.	2002-09-14 19:59:20 +00:00
Bruce Momjian	e50f52a074	pgindent run.	2002-09-04 20:31:48 +00:00
Bruce Momjian	97ac103289	Remove sys/types.h in files that include postgres.h, and hence c.h, because c.h has sys/types.h.	2002-09-02 02:47:07 +00:00
Bruce Momjian	b1a5f87209	Tom Lane wrote: > There's no longer a separate call to heap_storage_create in that routine > --- the right place to make the test is now in the storage_create > boolean parameter being passed to heap_create. A simple change, but > it passeth patch's understanding ... Thanks. Attached is a patch against cvs tip as of 8:30 PM PST or so. Turned out that even after fixing the failed hunks, there was a new spot in bufmgr.c which needed to be fixed (related to temp relations; RelationUpdateNumberOfBlocks). But thankfully the regression test code caught it :-) Joe Conway	2002-08-15 16:36:08 +00:00
Tom Lane	e44beef712	Code review of CLUSTER patch. Clean up problems with relcache getting confused, toasted data getting lost, etc.	2002-08-11 21:17:35 +00:00
Tom Lane	5df307c778	Restructure local-buffer handling per recent pghackers discussion. The local buffer manager is no longer used for newly-created relations (unless they are TEMP); a new non-TEMP relation goes through the shared bufmgr and thus will participate normally in checkpoints. But TEMP relations use the local buffer manager throughout their lifespan. Also, operations in TEMP relations are not logged in WAL, thus improving performance. Since it's no longer necessary to fsync relations as they move out of the local buffers into shared buffers, quite a lot of smgr.c/md.c/fd.c code is no longer needed and has been removed: there's no concept of a dirty relation anymore in md.c/fd.c, and we never fsync anything but WAL. Still TODO: improve local buffer management algorithms so that it would be reasonable to increase NLocBuffer.	2002-08-06 02:36:35 +00:00
Bruce Momjian	8864603f3c	Minor code cleanup in bufmgr.c and bufmgr.h, mainly by moving repeated lines of code into internal routines (drop_relfilenode_buffers, release_buffer) and by hiding unused routines (PrintBufferDescs, PrintPinnedBufs) behind #ifdef NOT_USED. Remove AbortBufferIO() declaration from bufmgr.c (already declared in bufmgr.h) Manfred Koizar	2002-07-02 05:47:37 +00:00
Bruce Momjian	d84fe82230	Update copyright to 2002.	2002-06-20 20:29:54 +00:00
Bruce Momjian	6e8a1a6717	WriteBuffer return value: >I'd vote for changing WriteBuffer to >return void, and have it elog() on bad argument. Manfred Koizar	2002-06-15 19:59:59 +00:00
Bruce Momjian	918e864f14	Remove some pre-WAL relics: SharedBufferChanged BufferRelidLastDirtied BufferTagLastDirtied BufferDirtiedByMe Manfred Koizar	2002-06-15 19:55:38 +00:00
Bruce Momjian	171824087c	The patch I sent to -patches a little while ago wasn't applied: it was in the thread "make BufferGetBlockNumber() a macro". Tom objected to the original patch, so I prepared a new one which doesn't change BufferGetBlockNumber() into a macro, it just cleans up some comments and fixes an assertion. The patch is attached. Neil Conway	2002-04-15 23:47:12 +00:00
Bruce Momjian	92288a1cf9	Change made to elog: o Change all current CVS messages of NOTICE to WARNING. We were going to do this just before 7.3 beta but it has to be done now, as you will see below. o Change current INFO messages that should be controlled by client_min_messages to NOTICE. o Force remaining INFO messages, like from EXPLAIN, VACUUM VERBOSE, etc. to always go to the client. o Remove INFO from the client_min_messages options and add NOTICE. Seems we do need three non-ERROR elog levels to handle the various behaviors we need for these messages. Regression passed.	2002-03-06 06:10:59 +00:00
Bruce Momjian	a033daf566	Commit to match discussed elog() changes. Only update is that LOG is now just below FATAL in server_min_messages. Added more text to highlight ordering difference between it and client_min_messages. --------------------------------------------------------------------------- REALLYFATAL => PANIC STOP => PANIC New INFO level the prints to client by default New LOG level the prints to server log by default Cause VACUUM information to print only to the client NOTICE => INFO where purely information messages are sent DEBUG => LOG for purely server status messages DEBUG removed, kept as backward compatible DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1 added DebugLvl removed in favor of new DEBUG[1-5] symbols New server_min_messages GUC parameter with values: DEBUG[5-1], INFO, NOTICE, ERROR, LOG, FATAL, PANIC New client_min_messages GUC parameter with values: DEBUG[5-1], LOG, INFO, NOTICE, ERROR, FATAL, PANIC Server startup now logged with LOG instead of DEBUG Remove debug_level GUC parameter elog() numbers now start at 10 Add test to print error message if older elog() values are passed to elog() Bootstrap mode now has a -d that requires an argument, like postmaster	2002-03-02 21:39:36 +00:00
Tom Lane	f6ee99a062	Clean up usage-statistics display code (ShowUsage and friends). StatFp is gone, usage messages now go through elog(DEBUG).	2001-11-10 23:51:14 +00:00
Bruce Momjian	ea08e6cd55	New pgindent run with fixes suggested by Tom. Patch manually reviewed, initdb/regression tests pass.	2001-11-05 17:46:40 +00:00
Bruce Momjian	b81844b173	pgindent run on all C files. Java run to follow. initdb/regression tests pass.	2001-10-25 05:50:21 +00:00
Tom Lane	499abb0c0f	Implement new 'lightweight lock manager' that's intermediate between existing lock manager and spinlocks: it understands exclusive vs shared lock but has few other fancy features. Replace most uses of spinlocks with lightweight locks. All remaining uses of spinlocks have very short lock hold times (a few dozen instructions), so tweak spinlock backoff code to work efficiently given this assumption. All per my proposal on pghackers 26-Sep-01.	2001-09-29 04:02:27 +00:00
Tom Lane	55432fedd2	Implement LockBufferForCleanup(), which will allow concurrent VACUUM to wait until it's safe to remove tuples and compact free space in a shared buffer page. Miscellaneous small code cleanups in bufmgr, too.	2001-07-06 21:04:26 +00:00
Tom Lane	a29f6c095c	Make the found-a-buffer-when-we-were-expecting-to-extend-the-rel path actually work. It had been throwing an Assert as of my recent changes to bufmgr.c, but was not really right even before that AFAICT.	2001-07-02 18:47:18 +00:00
Tom Lane	af5ced9cfd	Further work on connecting the free space map (which is still just a stub) into the rest of the system. Adopt a cleaner approach to preventing deadlock in concurrent heap_updates: allow RelationGetBufferForTuple to select any page of the rel, and put the onus on it to lock both buffers in a consistent order. Remove no-longer-needed isExtend hack from API of ReleaseAndReadBuffer.	2001-06-29 21:08:25 +00:00

... 3 4 5 6 7 ...

563 Commits