Fix for for the problem with
- Cross-outer-join dependency
- dead-end join prefix
- join order pruning
See the comments in the patch for detailed description
The easiest way to compile and test the server with UBSAN is to run:
./BUILD/compile-pentium64-ubsan
and then run mysql-test-run.
After this commit, one should be able to run this without any UBSAN
warnings. There is still a few compiler warnings that should be fixed
at some point, but these do not expose any real bugs.
The 'special' cases where we disable, suppress or circumvent UBSAN are:
- ref10 source (as here we intentionally do some shifts that UBSAN
complains about.
- x86 version of optimized int#korr() methods. UBSAN do not like unaligned
memory access of integers. Fixed by using byte_order_generic.h when
compiling with UBSAN
- We use smaller thread stack with ASAN and UBSAN, which forced me to
disable a few tests that prints the thread stack size.
- Verifying class types does not work for shared libraries. I added
suppression in mysql-test-run.pl for this case.
- Added '#ifdef WITH_UBSAN' when using integer arithmetic where it is
safe to have overflows (two cases, in item_func.cc).
Things fixed:
- Don't left shift signed values
(byte_order_generic.h, mysqltest.c, item_sum.cc and many more)
- Don't assign not non existing values to enum variables.
- Ensure that bool and enum values are properly initialized in
constructors. This was needed as UBSAN checks that these types has
correct values when one copies an object.
(gcalc_tools.h, ha_partition.cc, item_sum.cc, partition_element.h ...)
- Ensure we do not called handler functions on unallocated objects or
deleted objects.
(events.cc, sql_acl.cc).
- Fixed bugs in Item_sp::Item_sp() where we did not call constructor
on Query_arena object.
- Fixed several cast of objects to an incompatible class!
(Item.cc, Item_buff.cc, item_timefunc.cc, opt_subselect.cc, sql_acl.cc,
sql_select.cc ...)
- Ensure we do not do integer arithmetic that causes over or underflows.
This includes also ++ and -- of integers.
(Item_func.cc, Item_strfunc.cc, item_timefunc.cc, sql_base.cc ...)
- Added JSON_VALUE_UNITIALIZED to json_value_types and ensure that
value_type is initialized to this instead of to -1, which is not a valid
enum value for json_value_types.
- Ensure we do not call memcpy() when second argument could be null.
- Fixed that Item_func_str::make_empty_result() creates an empty string
instead of a null string (safer as it ensures we do not do arithmetic
on null strings).
Other things:
- Changed struct st_position to an OBJECT and added an initialization
function to it to ensure that we do not copy or use uninitialized
members. The change to a class was also motived that we used "struct
st_position" and POSITION randomly trough the code which was
confusing.
- Notably big rewrite in sql_acl.cc to avoid using deleted objects.
- Changed in sql_partition to use '^' instead of '-'. This is safe as
the operator is either 0 or 0x8000000000000000ULL.
- Added check for select_nr < INT_MAX in JOIN::build_explain() to
avoid bug when get_select() could return NULL.
- Reordered elements in POSITION for better alignment.
- Changed sql_test.cc::print_plan() to use pointers instead of objects.
- Fixed bug in find_set() where could could execute '1 << -1'.
- Added variable have_sanitizer, used by mtr. (This variable was before
only in 10.5 and up). It can now have one of two values:
ASAN or UBSAN.
- Moved ~Archive_share() from ha_archive.cc to ha_archive.h and marked
it virtual. This was an effort to get UBSAN to work with loaded storage
engines. I kept the change as the new place is better.
- Added in CONNECT engine COLBLK::SetName(), to get around a wrong cast
in tabutil.cpp.
- Added HAVE_REPLICATION around usage of rgi_slave, to get embedded
server to compile with UBSAN. (Patch from Marko).
- Added #ifdef for powerpc64 to avoid a bug in old gcc versions related
to integer arithmetic.
Changes that should not be needed but had to be done to suppress warnings
from UBSAN:
- Added static_cast<<uint16_t>> around shift to get rid of a LOT of
compiler warnings when using UBSAN.
- Had to change some '/' of 2 base integers to shift to get rid of
some compile time warnings.
Reviewed by:
- Json changes: Alexey Botchkov
- Charset changes in ctype-uca.c: Alexander Barkov
- InnoDB changes & Embedded server: Marko Mäkelä
- sql_acl.cc changes: Vicențiu Ciorbaru
- build_explain() changes: Sergey Petrunia
SELECT_LEX objects that are "fake_select_lex" (i.e read UNION output)
used both INT_MAX and UINT_MAX as select_number.
- mysql_explain_union() assigned UINT_MAX
- st_select_lex_unit::add_fake_select_lex assigned INT_MAX
This didn't matter initially (before EXPLAIN FORMAT=JSON), because the
code had no checks for this value.
EXPLAIN FORMAT=JSON and later other features did introduce checks for
select_number values. The check had to check for two constants and
looked really confusing.
This patch joins the two constants into one - FAKE_SELECT_LEX_ID.
At the second execution of the PS
1. mark_as_dependent() is called with the same parameters as at the first
execution (select#4 and select#3)
2. as outer_select (select#3) has been already merged at the first
execution of PS it cannot be reached using the outer_select() function
anymore (and so can not stop iteration).
3. as a result all selects towards the top level select including the
select for 'ca' are marked as uncacheable.
4. Marked uncacheable it executed incorrectly triggering filling its
temporary table several times and using freed memory at the end.
To avoid the problem we use name resolution context to go "up".
NOTE: problem also exists in 10.2 but has no visible effect on execution.
That is why the problem is fixed in 10.2.
The patch also add debug logging of important procedures and
better specify parameters types of st_select_lex::mark_as_dependent.
Adds an implementation for SELECT ... FOR UPDATE SKIP LOCKED /
SELECT ... LOCK IN SHARED MODE SKIP LOCKED
This is implemented only InnoDB at the moment, not in RockDB yet.
This adds a new hander flag HA_CAN_SKIP_LOCKED than
will be used when the storage engine advertises the flag.
When a storage engine indicates this flag it will get
TL_WRITE_SKIP_LOCKED and TL_READ_SKIP_LOCKED transaction types.
The Lex structure has been updated to store both the FOR UPDATE/LOCK IN
SHARE as well as the SKIP LOCKED so the SHOW CREATE VIEW
implementation is simplier.
"SELECT FOR UPDATE ... SKIP LOCKED" combined with CREATE TABLE AS or
INSERT.. SELECT on the result set is not safe for STATEMENT based
replication. MIXED replication will replicate this as row based events."
Thanks to guidance from Facebook commit
193896c466
This helped verify basic test case, and components that need implementing
(even though every part was implemented differently).
Thanks Marko for guidance on simplier InnoDB implementation.
Reviewers: Marko, Monty
Use < TL_FIRST_WRITE for determining a READ transaction.
Use TL_FIRST_WRITE as the relative operator replacing TL_WRITE_ALLOW_WRITE
as the minimium WRITE lock type.
splittable derived
If one of joined tables of the processed query is a materialized derived
table (or view or CTE) with GROUP BY clause then under some conditions it
can be subject to split optimization. With this optimization new equalities
are injected into the WHERE condition of the SELECT that specifies this
derived table. The injected equalities are generated for all join orders
with which the split optimization can employed. After the best join order
has been chosen only certain of this equalities are really needed. The
others can be safely removed. If it's not done and some of injected
equalities involve expressions over semi-joins with look-up access then
the query may return a wrong result set.
This patch effectively removes equalities injected for split optimization
that are needed only at the optimization stage and not needed for execution.
Approved by serg@mariadb.com
row_number() over () window function can be used without any column in the OVER
clause. Additionally, the item doesn't reference any tables, as it's not
effectively referencing any table. Rather it is specifically built based
on the end temporary table used for window function computation.
This caused remove_const function to wrongly drop it from the ORDER
list. Effectively, we shouldn't be dropping any window function from the
ORDER clause, so adjust remove_const to account for that.
Reviewed by: Sergei Petrunia sergey@mariadb.com
Attempt to execute EXPLAIN statement on multi-table DELETE statement
leads to firing firing of the assertion
DBUG_ASSERT(! is_set());
in the method Diagnostics_area::set_eof_status.
For example, above mentioned assertion failure happens
in case any of the following statements
EXPLAIN DELETE FROM t1.* USING t1
EXPLAIN DELETE b FROM t1 AS a JOIN t1 AS b
are executed in prepared statement mode provided the table t1
does exist.
This assertion is hit by the reason that a status of
Diagnostics_area is set twice. The first time it is set from
the function do_select() when the method multi_delete::send_eof()
called. The second time it is set when the method
Explain_query::send_explain() calls the method select_send::send_eof
(this method invokes the method Diagnostics_area::set_eof_status that
finally hits assertion)
The second invocation for a setter method of the class Diagnostics_area
is correct and run to send a response containing explain data.
But first invocation of a setter method of the class Diagnostics_area
is wrong since the function do_select() shouldn't be called at all
for handling of the EXPLAIN statement.
The reason by that the function do_select() is called during handling of
the EXPLAIN statement is that the flag SELECT_DESCRIBE not set in the
data member JOIN::select_options. The flag SELECT_DESCRIBE
if is copied from values select_lex->options.
During parsing of EXPLAIN statement this flag is set but latter reset
from the function reinit_stmt_before_use() that is called on
execution of prepared statement.
void reinit_stmt_before_use(THD *thd, LEX *lex)
{
...
for (; sl; sl= sl->next_select_in_list())
{
if (sl->changed_elements & TOUCHED_SEL_COND)
{
/* remove option which was put by mysql_explain_union() */
sl->options&= ~SELECT_DESCRIBE;
...
}
...
}
So, to fix the issue the flag SELECT_DESCRIBE is set forcibly at the
mysql_select() function in case thd->lex->describe set,
that is in case EXPLAIN being executed.
Fixes also:
MDEV-24942 Server crashes in _ma_rec_pack... with DEFAULT() on BLOB
This was caused by two different bugs, both related to that the default
value for the blob was not calculated before it was used:
- There where now Item_default_value::..result() wrappers, which is
needed as item in HAVING uses these. This causes crashes when
using a reference to a DEFAULT(blob_field) in HAVING. It also
caused wrong results when used with other fields with default value
expressions that are not constants.
- create_tmp_field() did not take into account that blob fields with
default expressions are not yet initialized. Fixed by treating
Item_default_value(blob) like a normal item expression.
The failure happened for group by queries when all tables where marked as
'const tables' (tables with 0-1 matching rows) and no row matched the
where clause and there was in addition a direct reference to a field.
In this case the field would not be properly reset and the query would
return 'random data' that happended to be in table->record[0].
Fixed by marking all const tables as null tables in this particular case.
Sergei also provided an extra test case for the code.
@reviewer Sergei Petrunia <psergey@askmonty.org>
When creating a summary temporary table with bit fields used in the sum
expression with several parameters, like GROUP_CONCAT(), the counting of
bits needed in the record was wrong.
The reason we got an assert in Aria was because the bug caused a memory
overwrite in the record and Aria noticed that the data was 'impossible.
For an IN/ANY/ALL subquery without an aggregate function and HAVING clause,
the GROUP BY clause is removed.
Due to the GROUP BY list being removed, the invalid reference in the GROUP BY
clause was never resolved.
Remove the GROUP BY list only when the all the items in the GROUP BY list
are resolved.
Also removing the GROUP BY list later would not affect the extension that allows
using non-aggregated field in an aggregate function (when ONLY_FULL_GROUP_BY
is not set) because the GROUP BY list is removed only when their is
NO aggregate function in IN/ALL/ANY subquery.
The assertion failed in handler::ha_reset upon SELECT under
READ UNCOMMITTED from table with index on virtual column.
This was the debug-only failure, though the problem is mush wider:
* MY_BITMAP is a structure containing my_bitmap_map, the latter is a raw
bitmap.
* read_set, write_set and vcol_set of TABLE are the pointers to MY_BITMAP
* The rest of MY_BITMAPs are stored in TABLE and TABLE_SHARE
* The pointers to the stored MY_BITMAPs, like orig_read_set etc, and
sometimes all_set and tmp_set, are assigned to the pointers.
* Sometimes tmp_use_all_columns is used to substitute the raw bitmap
directly with all_set.bitmap
* Sometimes even bitmaps are directly modified, like in
TABLE::update_virtual_field(): bitmap_clear_all(&tmp_set) is called.
The last three bullets in the list, when used together (which is mostly
always) make the program flow cumbersome and impossible to follow,
notwithstanding the errors they cause, like this MDEV-17556, where tmp_set
pointer was assigned to read_set, write_set and vcol_set, then its bitmap
was substituted with all_set.bitmap by dbug_tmp_use_all_columns() call,
and then bitmap_clear_all(&tmp_set) was applied to all this.
To untangle this knot, the rule should be applied:
* Never substitute bitmaps! This patch is about this.
orig_*, all_set bitmaps are never substituted already.
This patch changes the following function prototypes:
* tmp_use_all_columns, dbug_tmp_use_all_columns
to accept MY_BITMAP** and to return MY_BITMAP * instead of my_bitmap_map*
* tmp_restore_column_map, dbug_tmp_restore_column_maps to accept
MY_BITMAP* instead of my_bitmap_map*
These functions now will substitute read_set/write_set/vcol_set directly,
and won't touch underlying bitmaps.
I_S tables were materialized too late, an attempt to use table
statistics before the table was created caused a crash.
Let's move table creation up. it only needs read_set to
be calculated properly, this happens in JOIN::optimize_inner(),
after semijoin transformation.
Note that tables are not populated at that point, so most of the
statistics would make no sense anyway. But at least field sizes
will be correct. And it won't crash.
- Create_tmp_table::finalize didn't clear file after delete which
could cause a double free. This is however not a likely problem as
this code path is very unlikely to happen
- free_tmp_table() could do handler calls even if the table was never
opened. Fixed by adding a test if the table is opened.
The assertion failed in handler::ha_reset upon SELECT under
READ UNCOMMITTED from table with index on virtual column.
This was the debug-only failure, though the problem is mush wider:
* MY_BITMAP is a structure containing my_bitmap_map, the latter is a raw
bitmap.
* read_set, write_set and vcol_set of TABLE are the pointers to MY_BITMAP
* The rest of MY_BITMAPs are stored in TABLE and TABLE_SHARE
* The pointers to the stored MY_BITMAPs, like orig_read_set etc, and
sometimes all_set and tmp_set, are assigned to the pointers.
* Sometimes tmp_use_all_columns is used to substitute the raw bitmap
directly with all_set.bitmap
* Sometimes even bitmaps are directly modified, like in
TABLE::update_virtual_field(): bitmap_clear_all(&tmp_set) is called.
The last three bullets in the list, when used together (which is mostly
always) make the program flow cumbersome and impossible to follow,
notwithstanding the errors they cause, like this MDEV-17556, where tmp_set
pointer was assigned to read_set, write_set and vcol_set, then its bitmap
was substituted with all_set.bitmap by dbug_tmp_use_all_columns() call,
and then bitmap_clear_all(&tmp_set) was applied to all this.
To untangle this knot, the rule should be applied:
* Never substitute bitmaps! This patch is about this.
orig_*, all_set bitmaps are never substituted already.
This patch changes the following function prototypes:
* tmp_use_all_columns, dbug_tmp_use_all_columns
to accept MY_BITMAP** and to return MY_BITMAP * instead of my_bitmap_map*
* tmp_restore_column_map, dbug_tmp_restore_column_maps to accept
MY_BITMAP* instead of my_bitmap_map*
These functions now will substitute read_set/write_set/vcol_set directly,
and won't touch underlying bitmaps.
This bug could cause a crash when executing queries that used mutually
recursive CTEs with system variable big_tables set to 1. It happened due
to several bugs in the code that handled recursive table references
referred mutually recursive CTEs. For each recursive table reference a
temporary table is created that contains all rows generated for the
corresponding recursive CTE table on the previous step of recursion.
This temporary table should be created in the same way as the temporary
table created for a regular materialized derived table using the
method select_union::create_result_table(). In this case when the
temporary table is created it uses the select_union::TMP_TABLE_PARAM
structure as the parameter for the table construction. However the
code created the temporary table using just the function create_tmp_table()
and passed pointers to certain fields of the TMP_TABLE_PARAM structure
used for accumulation of rows of the recursive CTE table as parameters
for update. This was a mistake because now different temporary tables
cannot share some TMP_TABLE_PARAM fields in a general case. Besides,
depending on how mutually recursive CTE tables were defined and which
of them were referred in the executed query the select_union object
allocated for a recursive table reference could be allocated again after
the the temporary table had been created. In this case the TMP_TABLE_PARAM
object associated with the temporary table created for the recursive
table reference contained unassigned fields needed for execution when
Aria engine is employed as the engine for temporary tables.
This patch ensures that
- select_union object is created only once for any recursive table
reference
- any temporary table created for recursive CTEs uses its own
TMP_TABLE_PARAM structure
The patch also fixes a problem caused by incomplete cleanup of join tables
associated with recursive table references.
Approved by Oleksandr Byelkin <sanja@mariadb.com>
A heuristic in best_access_path says that if for an index
ref access involved key parts which are greater than equal to that
for range access, then range access should not be considered.
The assumption made by this heuristic does not hold when
the range optimizer opted to use the group-by min-max optimization.
So the fix here would be to not consider the heuristic if
the range optimizer picked the usage of group-by min-max
optimization.
Due to a premature cleanup of the unit that specified a recursive CTE
used in the second operand of union the server fell into an infinite
loop in the reported test case. In other cases this premature cleanup
could cause other problems.
The bug is the result of a not quite correct fix for MDEV-17024. The
unit that specifies a recursive CTE has to be cleaned only after the
cleanup of the last external reference to this CTE. It means that
cleanups of the unit triggered not by the cleanup of a external
reference to the CTE must be blocked.
Usage of local table chains in selects to get external references to
recursive CTEs was not correct either because of possible merges of
some selects.
Also fixed a minor bug in st_select_lex::set_explain_type() that caused
typing 'RECURSIVE UNION' instead of 'UNION' in EXPLAIN output for external
references to a recursive CTE.
This follows up commit
commit 94a520ddbe and
commit 7c5519c12d.
After these changes, the default test suites on a
cmake -DWITH_UBSAN=ON build no longer fail due to passing
null pointers as parameters that are declared to never be null,
but plenty of other runtime errors remain.
Reimplement MDEV-14275 Improving memory utilization for information schema
Postpone temp table instantiation until after setup_fields().
Replace all unused (not marked in read_set) columns in an I_S table
with CHAR(0). This can drastically reduce the footprint of a MEMORY
table (a TABLE_CATALOG alone is 1538 bytes per row).
This does not change the engine. If the table was decided to be Aria
(because of, say, blobs) then after optimization it'll stay Aria
even if all blobs were removed.
Note 1: when transforming table structure, share->blob_fields is
preserved, otherwise Aria might switch from DYNAMIC to STATIC row format
and expect a special field for a deleted mark, which create_tmp_tabe
didn't provide.
Note 2: optimizer was doing handler::info() (to know the number of rows)
before the temp table is populated. That didn't make much sense. Now
it's done before the table is even instantiated. Preserve the old
behavior and report 0 rows.
This reverts e2664ee836 and a8458a2345
PARTITION clause in SELECT means query is non-versioned (see
WITH_PARTITION_STORAGE_ENGINE in vers_setup_conds()).
vers_setup_conds() expands such query to SYSTEM_TIME_ALL which is then
added to VIEW specification. When VIEW is queried both clauses
PARTITION and FOR SYSTEM_TIME ALL lead to ER_VERS_QUERY_IN_PARTITION
(same place WITH_PARTITION_STORAGE_ENGINE).
Fix removes FOR SYSTEM_TIME ALL from VIEW by accessing original
SYSTEM_TIME clause: the one specified in parser. As a side-effect
EXPLAIN SELECT displays SYSTEM_TIME specified in SELECT which is
user-friendly.
For join to work correctly versioning condition must be added to table
on_expr. Without that JOIN_CACHE gets expression (1)
trigcond(xtitle.row_end = TIMESTAMP'2038-01-19 06:14:07.999999') and
trigcond(xtitle.elementId = x.`id` and xtitle.pkey = 'title')
instead of (2)
trigcond(xtitle.elementId = x.`id` and xtitle.pkey = 'title')
for join_null_complements(). It is NULL-row of xtitle for
complementing the join and the above comparisons of course FALSE, but
trigcond (Item_func_trig_cond) makes them TRUE via its trig_var
property which is bound to some boolean properties of JOIN_TAB.
Expression (2) evaluated to TRUE because its trig_var is bound to
first_inner_tab->not_null_compl. The expression (1) does not evaluate
correctly because row_end comparison's trig_var is bound to
first_inner->found earlier. As a result JOIN_CACHE::check_match()
skipped the row for join_null_complements().
When we add versioning condition to table's on_expr the optimizer in
make_join_select() distributes conditions differently. tmp_cond
inherits on_expr value and in Good case it is full expression
xgender.elementId = x.`id` and xgender.pkey = 'gender' and
xgender.row_end = TIMESTAMP'2038-01-19 06:14:07.999999'
while in Bad case it is only
xgender.elementId = x.`id` and xgender.pkey = 'gender'.
Later in Good row_end condition is optimized out and we get one
trigcond in form of (2).
Diagnostics_area::set_error_status
Analysis: When strict mode is enabled, all warnings are converted to errors
including those which do not occur because of bad data.
Fix: Query should not be aborted when we have warning because limit to
examine rows was reached because it doesn't happen due to bad data.
So thd->abort_on_warning should be false.
The issue here was that the query was using ORDER BY LIMIT optimzation where
the access method was changed from EQ_REF access to an index scan (index that would
resolve the ORDER BY clause).
But the parameter READ_RECORD::unlock_row was not reset to rr_unlock_row, which is
used when the access method is not EQ_REF access.
Problem:
Queries like this showed performance degratation in 10.4 over 10.3:
SELECT temporal_literal FROM t1;
SELECT temporal_literal + 1 FROM t1;
SELECT COUNT(*) FROM t1 WHERE temporal_column = temporal_literal;
SELECT COUNT(*) FROM t1 WHERE temporal_column = string_literal;
Fix:
Replacing the universal member "MYSQL_TIME cached_time" in
Item_temporal_literal to data type specific containers:
- Date in Item_date_literal
- Time in Item_time_literal
- Datetime in Item_datetime_literal
This restores the performance, and make it even better in some cases.
See benchmark results in MDEV.
Also, this change makes futher separations of Date, Time, Datetime
from each other, which will make it possible not to derive them from
a too heavy (40 bytes) MYSQL_TIME, and replace them to smaller data
type specific containers.
The issue here is when records are read from the temporary file
(filesort result in this case) via a cache(rr_from_cache).
The cache is initialized with init_rr_cache.
For correlated subquery the cache allocation is happening at each execution
of the subquery but the deallocation happens only once and that was
when the query execution was done.
So generally for subqueries we do two types of cleanup
1) Full cleanup: we should free all resources of the query(like temp tables).
This is done generally when the query execution is complete or the subquery
re-execution is not needed (case with uncorrelated subquery)
2) Partial cleanup: Minor cleanup that is required if
the subquery needs recalculation. This is done for all the structures that
need to be allocated for each execution (example SORT_INFO for filesort
is allocated for each execution of the correlated subquery).
The fix here would be free the cache used by rr_from_cache in the partial
cleanup phase.
* Fix the crash: IN-to-EXISTS rewrite causes an error (and so
JOIN::optimize() fails with an error, too), don't call
update_used_tables(). Terminate the query execution instead.
* Fix the cause of the error in the IN-to-EXISTS rewrite: don't do
the rewrite if doing it will cause an error of this kind:
This version of MariaDB doesn't yet support 'SUBQUERY in ROW in left
expression of IN/ALL/ANY'
* Fix another issue exposed by this testcase:
JOIN::setup_subquery_caches() may be invoked before any select has
saved its query plan, and will crash because none of the SELECTs
has called create_explain_query_if_not_exists() to create the Explain
Data Structure for this SELECT.
TODO: When merging this to 10.2, remove the poorly-placed call to
create_explain_query_if_not_exists made by fix for M_D_E_V-16153
For the case when the SJM scan table is the first table in the join order,
then if we want to do the sorting on the SJM scan table, then we need to
make sure that we unpack the values to base table fields in two cases:
1) Reading the SJM table and writing the sort-keys inside the sort-buffer
2) Reading the sorted data from the sort file
first step in moving drop table out of the handler.
todo: other methods that don't need an open table
for now hton->drop_table is optional, for backward compatibility
reasons
- Some of the bug fixes are backports from 10.5!
- The fix in innobase/fil/fil0fil.cc is just a backport to get less
error messages in mysqld.1.err when running with valgrind.
- Renamed HAVE_valgrind_or_MSAN to HAVE_valgrind
Fixed by:
- Make all quick_* variable allocated according to real number keys instead
of MAX_KEY
- Store all the quick* items in separated allocated structure (OPT_RANGE)
- Ensure we don't access any quick* variable without first checking
opt_range_keys.is_set(). Thanks to this, we don't need any
pre-initialization of quick* variables anymore.
Some renames was done to use the new structure:
table->quick_keys -> table->opt_range_keys
table->quick_rows[X] -> table->opt_range[X].rows
table->quick_key_parts[X] -> table->opt_range[X].key_parts
table->quick_costs[X] -> table->opt_range[X].cost
table->quick_index_only_costs[X] -> table->opt_range[X].index_only_cost
table->quick_n_ranges[X] -> table->opt_range[X].ranges
table->quick_condition_rows -> table->opt_range_condition_rows
This patch should both decrease memory needed for TABLE objects
(3528 -> 984 + keyinfo) and increase performance, thanks to less
initializations per query, and more localized memory, thanks to the
opt_range structure.
- Removed not needed bzero in void TABLE::initialize_quick_structures().
- Replaced bzero with TRASH_ALLOC() to have this change verfied with
memory checkers
- Added missing table->quick_keys.is_set in table_cond_selectivity()
- select_describe() should not attempt to produce query plans
for subqueries if the query is handled by a Select Handler.
- JOIN::save_explain_data_intern should not add links to Explain_select
for children selects if:
1. The whole query is handled by the Select Handler, or
2. this select (and so its children) is handled by Derived Handler.
Starting from 10.3, the optimizer is able to detect that entire outer join
nests are constants (because of "Impossible ON") and remove them (see
mark_join_nest_as_const)
However, this was not properly accounted for in NESTED_JOIN structure
and the way check_interleaving_with_nj() uses its n_tables member to
check if the join prefix order is allowed.
(The result was that the optimizer could conclude that no join prefix is
allowed and fail an assertion)
The reson for the change was to make it easier to find true errors
when searching in trace logs.
"error:" should mainly be used when we have a real error
When a prepared statement parameter '?' is used in a CTE that is used
multiple times, the following happens:
- The CTE definition is re-parsed multiple times.
- There are multiple Item_param objects referring to the same "?" in
the original query.
- Prepared_statement::param has a pointer to the first of them, the
others are "clones".
- When prepared statement parameter gets the value, it should be passed
over to clones with param->sync_clones() call.
This call is made in insert_params(), etc. It was not made in
insert_params_with_log().
This would cause Item_param to not have any value which would confuse
the query optimizer.
Added the missing call.
In case of SELECT without tables which returns either 0 or 1 rows,
JOIN::exec_inner() did not check if the flag representing SQL_CALC_FOUND_ROWS
is set or not and send_records was direclty assigned 0. So SELECT FOUND_ROWS()
was giving 0 in the output. Now it checks if the flag is set, if it is set
send_record=1 else 0. 1 is the number of rows that could have been sent
to the client if the SELECT query had SQL_CALC_FOUND_ROWS.
It is 0 when no rows were sent because the SELECT query did not have
SQL_CALC_FOUND_ROWS.
UPDATE gets access to history records because versioning conditions
are not set for VIEW. This leads to endless loop of inserting history
records when clustered index is rebuilt and ha_rnd_next() returns
newly inserted history record.
Return back original behavior of failing on write-locked table in
historical query.
35b679b9 assumed that SELECT_LEX::lock_type influences anything, but
actually at this point table is already locked. Original bug report
was tempesta-tech/mariadb#102
A temporary table is needed for window function computation but if only a NAMED WINDOW SPEC
is used and there is no window function, then there is no need to create a temporary
table as there is no stage to compute WINDOW FUNCTION
1. Code simplification:
Item_default_value handled all these values:
a. DEFAULT(field)
b. DEFAULT
c. IGNORE
and had various conditions to distinguish (a) from (b) and from (c).
Introducing a new abstract class Item_contextually_typed_value_specification,
to handle (b) and (c), so the hierarchy now looks as follows:
Item
Item_result_field
Item_ident
Item_field
Item_default_value - DEFAULT(field)
Item_contextually_typed_value_specification
Item_default_specification - DEFAULT
Item_ignore_specification - IGNORE
2. Introducing a new virtual method is_evaluable_expression() to
determine if an Item is:
- a normal expression, so its val_xxx()/get_date() methods can be called
- or a just an expression substitute, whose value methods cannot be called.
3. Disallowing Items that are not evalualble expressions in table value
constructors.
For a unique key if all the keyparts are NOT NULL or the predicates involving
the keyparts is NULL rejecting, then we can use EQ_REF access instead of ref
access with the unique key
The reason for this is to make all temporary file names similar and
also to be able to figure out from where a #sql-xxx name orginates.
New format is for most cases:
'#sql-name-current_pid-thread_id[-increment]'
Where name is one of subselect, alter, exchange, temptable or backup
The exceptions are:
ALTER PARTITION shadow files:
'#sql-shadow-thread_id-'original_table_name'
Names used with temp pool:
'#sql-name-current_pid-pool_number'
- Print the rowid filters that are available for use with each table.
- Make print_best_access_for_table() print which filter it has picked.
- Make best_access_path() print the filter for considered ref accesses.
* rename to a generic name
* move remaning initializations from query exec to prepare time
* simplify/unify key handling in open_table_from_share and delayed
* remove dead code
* move tests where they belong
- multi_range_read_info_const now uses the new records_in_range interface
- Added handler::avg_io_cost()
- Don't calculate avg_io_cost() in get_sweep_read_cost if avg_io_cost is
not 1.0. In this case we trust the avg_io_cost() from the handler.
- Changed test_quick_select to use TIME_FOR_COMPARE instead of
TIME_FOR_COMPARE_IDX to align this with the rest of the code.
- Fixed bug when using test_if_cheaper_ordering where we didn't use
keyread if index was changed
- Fixed a bug where we didn't use index only read when using order-by-index
- Added keyread_time() to HEAP.
The default keyread_time() was optimized for blocks and not suitable for
HEAP. The effect was the HEAP prefered table scans over ranges for btree
indexes.
- Fixed get_sweep_read_cost() for HEAP tables
- Ensure that range and ref have same cost for simple ranges
Added a small cost (MULTI_RANGE_READ_SETUP_COST) to ranges to ensure
we favior ref for range for simple queries.
- Fixed that matching_candidates_in_table() uses same number of records
as the rest of the optimizer
- Added avg_io_cost() to JT_EQ_REF cost. This helps calculate the cost for
HEAP and temporary tables better. A few tests changed because of this.
- heap::read_time() and heap::keyread_time() adjusted to not add +1.
This was to ensure that handler::keyread_time() doesn't give
higher cost for heap tables than for normal tables. One effect of
this is that heap and derived tables stored in heap will prefer
key access as this is now regarded as cheap.
- Changed cost for index read in sql_select.cc to match
multi_range_read_info_const(). All index cost calculation is now
done trough one function.
- 'ref' will now use quick_cost for keys if it exists. This is done
so that for '=' ranges, 'ref' is prefered over 'range'.
- scan_time() now takes avg_io_costs() into account
- get_delayed_table_estimates() uses block_size and avg_io_cost()
- Removed default argument to test_if_order_by_key(); simplifies code
This was done to both simplify the code and also to be easier to handle
storage engines that are clustered on some other index than the primary
key.
As pk_is_clustering_key() and is_clustering_key now are using only
index_flags, these where removed from all storage engines.
MDEV-21606 Improve update handler (long unique keys on blobs)
MDEV-21470 MyISAM and Aria start_bulk_insert doesn't work with long unique
MDEV-21606 Bug fix for previous version of this code
MDEV-21819 2 Assertion `inited == NONE || update_handler != this'
- Move update_handler from TABLE to handler
- Move out initialization of update handler from ha_write_row() to
prepare_for_insert()
- Fixed that INSERT DELAYED works with update handler
- Give an error if using long unique with an autoincrement column
- Added handler function to check if table has long unique hash indexes
- Disable write cache in MyISAM and Aria when using update_handler as
if cache is used, the row will not be inserted until end of statement
and update_handler would not find conflicting rows.
- Removed not used handler argument from
check_duplicate_long_entries_update()
- Syntax cleanups
- Indentation fixes
- Don't use single character indentifiers for arguments
e.g.
- dont -> don't
- occurence -> occurrence
- succesfully -> successfully
- easyly -> easily
Also remove trailing space in selected files.
These changes span:
- server core
- Connect and Innobase storage engine code
- OQgraph, Sphinx and TokuDB storage engines
Related to MDEV-21769.
Create_tmp_table::add_fields(): Initialize uneven_delta= 0
to suppress the warning from GCC 9.2.1 and 10.0.1,
and consistently indent the code with spaces.
It was:
implicit conversion from 'ha_rows' (aka 'unsigned long long') to 'double'
changes value from 18446744073709551615 to 18446744073709551616
Follow what JOIN::get_examined_rows() does for similar code.
This task deals with packing the sort key inside the sort buffer, which would
lead to efficient usage of the memory allocated for the sort buffer.
The changes brought by this feature are
1) Sort buffers would have sort keys of variable length
2) The format for sort keys inside the sort buffer would look like
|<sort_length><null_byte><key_part1><null_byte><key_part2>.......|
sort_length is the extra bytes that are required to store the variable
length of a sort key.
3) When packing of sort key is done we store the ORIGINAL VALUES inside
the sort buffer and not the STRXFRM form (mem-comparable sort keys).
4) Special comparison function packed_keys_comparison() is introduced
to compare 2 sort keys.
This patch also contains contributions from Sergei Petrunia.
- Added unlikely() to optimize for not having optimizer trace enabled
- Made THD::trace_started() inline
- Added 'if (trace_enabled())' around some potentially expensive code
(not many found)
- Added ASSERT's to ensure we don't call expensive optimizer trace calls
if optimizer trace is not enabled
- Added length to Json_writer functions to speed up buffer writes
when optimizer trace is enabled.
- Changed LEX_CSTRING argument handling to not send full struct to writer
function on_add_str() functions now trusts length arguments
[Variant 2 of the fix: collect the attached conditions]
Problem:
make_join_select() has a section of code which starts with
"We plan to scan all rows. Check again if we should use an index."
the code in that section will [unnecessarily] re-run the range
optimizer using this condition:
condition_attached_to_current_table AND current_table's_ON_expr
Note that the original invocation of range optimizer in
make_join_statistics was done using the whole select's WHERE condition.
Taking the whole select's WHERE condition and using multiple-equalities
allowed the range optimizer to infer more range restrictions.
The fix:
- Do range optimization using a condition that is an AND of this table's
condition and all of the previous tables' conditions.
- Also, fix the range optimizer to prefer SEL_ARGs with type=KEY_RANGE
over SEL_ARGS with type=MAYBE_KEY, regardless of the key part.
Computing
key_and(
SEL_ARG(type=MAYBE_KEY key_part=1),
SEL_ARG(type=KEY_RANGE, key_part=2)
)
will now produce the SEL_ARG with type=KEY_RANGE.
This task deals with packing the non-sorted fields (or addon fields).
This would lead to efficient usage of the memory allocated for the sort buffer.
The changes brought by this feature are
1) Sort buffers would have records of variable length
2) Each record in the sort buffer would be stored like
<sort_key1><sort_key2>....<addon_length><null_bytes><field1><field2>....
addon_length is the extra bytes that are required to store the variable
length of addon field across different records.
3) Changes in rr_unpack_from_buffer and rr_from_tempfile to take into account
the variable length of records.
Ported WL#1509 Pack values of non-sorted fields in the sort buffer from
MySQL by Tor Didriksen
row_search_idx_cond_check with rowid_filter upon concurrent access to table
This bug has nothing to do with the concurrent access to table. Rather it
concerns queries for which the optimizer decides to employ a rowid filter
when accessing an InnoDB table by a secondary index, but later when
calling test_if_skip_sort_order() changes its mind to access the table by
the primary key.
Currently usage of rowid filters is not supported in InnoDB if the table
is accessed by the primary key. So in this case usage of a rowid filter
to access the table must be prohibited.
In this scenario:
- There is a possible range access for table T
- And there is a ref access on the same index which uses fewer key parts
- The join optimizer picks the ref access (because it is cheaper)
- make_join_select applies this heuristic to switch to range:
/* Range uses longer key; Use this instead of ref on key */
Join buffer will be used without having called
JOIN_TAB::make_scan_filter(). This means, conditions that should be
checked when reading table T will be checked after T is joined with the
contents of the join buffer, instead.
Fixed this by adding a make_scan_filter() check.
(updated patch after backport to 10.3)
(Fix testcase on Windows)
The issue here is for degenerate joins we should execute the window
function but it is not getting executed in all the cases.
To get the window function values window function needs to be executed
always. This currently does not happen in few cases
where the join would return 0 or 1 row like
1) IMPOSSIBLE WHERE
2) MIN/MAX optimization
3) EMPTY CONST TABLE
The fix is to make sure that window functions get executed
and the temporary table is setup for the execution of window functions
The query requires 2 temporary tables for execution, the window function
is always attached to the last temporary table, but in this case the
result field of the window function points to the first temporary table
rather than the last one.
Fixed this by not changing window function items with temporary table
items of the first temporary table.
Don't do skip_setup_conds() unless all errors are checked.
Fixes following errors:
ER_PERIOD_NOT_FOUND
ER_VERS_QUERY_IN_PARTITION
ER_VERS_ENGINE_UNSUPPORTED
ER_VERS_NOT_VERSIONED
Don't do skip_setup_conds() unless all errors are checked.
Fixes following errors:
ER_PERIOD_NOT_FOUND
ER_VERS_QUERY_IN_PARTITION
ER_VERS_ENGINE_UNSUPPORTED
ER_VERS_NOT_VERSIONED
in row_search_idx_cond_check
When usage of rowid filter is evaluated by the optimizer to join a table
to the current partial join employing a certain index it should be checked
that a key for at least the major component of this index can be constructed
using values from the columns of the partial join.
in row_search_idx_cond_check
For a single table query with ORDER BY and several sargable range
conditions the optimizer may choose an execution plan that employs
a rowid filter. In this case it is important to build the filter before
calling the function JOIN_TAB::sort_table() that creates sort index
for the result set, because when this is index created the filter has
to be already filled. After the sort index has been created the
filter must be deactivated. If not to do this the innodb function
row_search_idx_cond_check() is getting confused when it has to read rows
from the created sort index by using ha_rnd_pos().
The order of actions mentioned above is needed also when processing a
join query if sorting is performed for the first non constant table in
the chosen execution plan.
MDEV-18957 UPDATE with LIMIT clause is wrong for versioned partitioned tables
UPDATE, DELETE: replace linear search of current/historical records
with vers_setup_conds().
Additional DML cases in view.test
Count the "gap" time between table accesses and display it as
r_other_time_ms in the "table" element.
* The advantage of this approach is that it doesn't add any new
my_timer_cycles() calls.
* The disadvantage is that the definition of what is done during
"other time" is not that clear: it includes checking the WHERE
(for this table), constructing index lookup tuple (for the next table)
writing to GROUP BY temporary table (as we dont account for that time
separately [yet], etc)
The issue here is the wrong estimate of the cardinality of a partial join,
the cardinality is too high because the function table_cond_selectivity()
returns an absurd number 100 while selectivity cannot be greater than 1.
When accessing table t by outer reference t1.a via index we do not perform any
range analysis for t. Yet we see TABLE::quick_key_parts[key] and
TABLE->quick_rows[key] contain a non-zero value though these should have been
remained untouched and equal to 0.
Thus real cause of the problem is that TABLE::init does not clean the arrays
TABLE::quick_key_parts[] and TABLE::>quick_rows[].
It should have done it because the TABLE structure created for any
instance of a table can be reused for many queries.
A conflict between MDEV-19514 (b42294bc64)
and MDEV-20934 (d7a2401750)
was resolved. We will not invoke the function ibuf_delete_recs()
from ibuf_merge_or_delete_for_page(). Instead, we will add that
logic to the function ibuf_read_merge_pages().
In the function prev_record_reads where one finds the different row combinations for a
subset of partial join, it did not take into account the selectivity of tables
involved in the subset of partial join.
because internally setup_wild() adjusts select_lex->with_wild directly
anyway, so there is no reason to pretend that the number of '*' may be
anything else but select_lex->with_wild
And don't update select_lex->item_list, because fields can come
from anywhere and don't necessarily have to be copied into select_lex.
Now both offset and limit are stored and do not chenged during execution
(offset is decreased during processing in versions before 10.5).
(Big part of this changes made by Monty)
These two methods:
- Item_result_field::create_tmp_field_ex()
- Item_func_user_var::create_tmp_field_ex()
had duplicate code, except that they used a different type handler.
Adding a protected method Item_result_field::create_tmp_field_ex_from_handler()
with a "const Type_handler*" parameter, and reusing it from the
two mentioned methods.
1. Removed TIMESTAMP/TRANSACTION unit auto-detection in favor of default TIMESTAMP.
Reasons:
1.1. rare practical use and doubtful advantage of such auto-detection;
1.2. it conflicts with MDEV-16226 (TRX_ID-based versioned tables performance improvement).
Needless check_unit membership removed.
2. SQL: versioning type handling refactoring
Vers_type_handler hierarchy stores versioning properties of type.
virtual Type_handler::vers() accesses specialization of
Vers_type_handler for specific type.
virtual Vers_type_handler::kind() returns versioning kind
(timestamp/trx_id).
Removed Type_handler::Vers_history_point_check_unit() in favor of
Type_handler::vers().
Renames:
require_timestamp() -> require_timestamp_error()
require_trx_id() -> require_trx_id_error()
EDIT by Alexander Barkov (@abarkov):
check_sys_fields() moved to Vers_type_handler::check_sys_fields()
selectivity values fails
After having set the assertion that checks validity of selectivity values
returned by the function table_cond_selectivity() a test case from
order_by.tesst failed. The failure occurred because range optimizer could
return as an estimate of the cardinality of the ranges built for an index
a number exceeding the total number of records in the table.
The second bug is more subtle. It may happen when there are several
indexes with same prefix defined on the first joined table t accessed by
a constant ref access. In this case the range optimizer estimates the
number of accessed records of t for each usable index and these
estimates can be different. Only the first of these estimates is taken
into account when the selectivity of the ref access is calculated.
However the optimizer later can choose a different index that provides
a different estimate. The function table_condition_selectivity() could use
this estimate to discount the selectivity of the ref access. This could
lead to an selectivity value returned by this function that was greater
that 1.
best_access_path() is called from two optimization phases:
1. Plan choice phase, in choose_plan(). Here, the join prefix being
considered is in join->positions[]
2. Plan refinement stage, in fix_semijoin_strategies_for_picked_join_order
Here, the join prefix is in join->best_positions[]
It used to access join->positions[] from stage #2. This didnt cause any
valgrind or asan failures (as join->positions[] has been written-to before)
but the effect was similar to that of reading the random data:
The join prefix we've picked (in join->best_positions) could have
nothing in common with the join prefix that was last to be considered
(in join->positions).
This patch introduces the optimization that allows range optimizer to
consider index range scans that are built employing NOT NULL predicates
inferred from WHERE conditions and ON expressions.
The patch adds a new optimizer switch not_null_range_scan.
(Backported to 10.3, addressed review input)
Sj_materialization_picker::check_qep(): fix error in cost/fanout
calculations:
- for each join prefix, add #prefix_rows / TIME_FOR_COMPARE to the cost,
like best_extension_by_limited_search does
- Remove the fanout produced by the subquery tables.
- Also take into account join condition selectivity
optimize_wo_join_buffering() (used by LooseScan and FirstMatch)
- also add #prefix_rows / TIME_FOR_COMPARE to the cost of each prefix.
- Also take into account join condition selectivity
handlerton that is able to processes the whole query. For that it
traverses tables from subqueries.
Select_handler now cleans up temporary table structures on dctor call.
The MDEV-20265 commit e746f451d5
introduces DBUG_ASSERT(right_op == r_tbl) in
st_select_lex::add_cross_joined_table(), and that assertion would
fail in several tests that exercise joins. That commit was skipped
in this merge, and a separate fix of MDEV-20265 will be necessary in 10.4.
Current easy fix is not possible, because SELECT clones ha_partition
and then closes the clone which leads to unclosed transaction in
partitions we forcely prune out. We cound solve this by closing these
partitions (and release from transaction) in
change_partitions_to_open() at versioning conditions stage, but this
is problematic because table lock is acquired for each partition at
open stage and therefore must be released when we close partition
handler in change_partitions_to_open(). More details in MDEV-20376.
This should change after MDEV-20250 where mechanism of opening
partitions will be improved.
This reverts commit cdbac54df0.
For MDEV-15955, the fix in create_tmp_field_from_item() would cause a
compilation error. After a discussion with Alexander Barkov, the fix
was omitted and only the test case was kept.
In 10.3 and later, MDEV-15955 is fixed properly by overriding
create_tmp_field() in Item_func_user_var.
When discounting selectivity of ref access, don't discount the
selectivity we've already discounted for range access.
The 10.1 version of the fix. Will need to adjust condition filtering
test results in 10.4
Exclude SELECT and INSERT SELECT from vers_set_hist_part(). We cannot
likewise exclude REPLACE SELECT because it may REPLACE into itself
(and REPLACE generates history).
INSERT also does not generate history, but we have history
modification setting which might be interfered.
The bug occured when the optimizer decided to use a rowid filter built
by a range index scan to access an InnoDB table with generated clustered
index.
When a table is accessed by a secondary index Idx employing a rowid filter the
the value of pk contained in the found index tuple is checked against the
filter. A call of the handler function position is supposed to put the
pk value into the handler::ref buffer. However for generated clustered
primary keys it did not happened. The patch fixes this problem.
cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Debug
Maintainer mode makes all warnings errors. This patch fix warnings. Mostly about
deprecated `register` keyword.
Too much warnings came from Mroonga and I gave up on it.
and WHERE filter afterwards
This patch complements the patch fixing the bug MDEV-6892. The latter
properly handled queries that used mergeable views returning constant
columns as inner tables of outer joins and whose where clause contained
predicates referring to these columns if the predicates of happened not
to be equality predicates. Otherwise the server still could return wrong
result sets for such queries. Besides the fix for MDEV-6892 prevented
some possible conversions of outer joins to inner joins for such queries.
This patch corrected the function check_simple_equality() to handle
properly conjunctive equalities of the where clause that refer to the
constant columns of mergeable views used as inner tables of an outer join.
The patch also changed the code of Item_direct_view_ref::not_null_tables().
This change allowed to take into account predicates containing references
to constant columns of mergeable views when converting outer joins into
inner joins.
Removed Field_map, since it was used only in a single function.
Fixed is_indexed_agg_distinct(), since it relied on initialization of
Bitmap in constructor.