The purpose of the semi_join
operator is, given an arrayOp RefArray
, to return an arrayOp with the same schema as RefArray
but with content filtered to those records matching a supplied R data from or second arrayOp instance on some subset of its dimensions and/or attributes.
We first create a RefArray
with content from ArrayContent
data frame:
ArrayContent = data.frame(
da=1:20, db=101:120,
lower = letters[1:20],
upper = LETTERS[1:20],
f_int32 = -20:-1,
f_int64 = 1:20 * 10.0,
f_bool = c(T,NA,F,NA,F),
f_double = c(3.14, 2.0, NA, 0, -99)
)
RefArray = conn$
array_from_df(ArrayContent,
template = "<lower:string COMPRESSION 'zlib', upper:string, f_int32:int32, f_int64:int64, f_bool: bool, f_double: double>
[da=0:*:0:*; db=0:*:0:*]",
force_template_schema = T)$ # Ensure RefArray has the exact schema as the template
persist(.gc = FALSE)
RefArray$to_afl()
#> [1] "Rarrayop_37e37fd1a2bb"
RefArray$to_df_all()
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
#> 6 6 106 f F -15 60 TRUE 3.14
#> 7 7 107 g G -14 70 NA 2.00
#> 8 8 108 h H -13 80 FALSE NA
#> 9 9 109 i I -12 90 NA 0.00
#> 10 10 110 j J -11 100 FALSE -99.00
#> 11 11 111 k K -10 110 TRUE 3.14
#> 12 12 112 l L -9 120 NA 2.00
#> 13 13 113 m M -8 130 FALSE NA
#> 14 14 114 n N -7 140 NA 0.00
#> 15 15 115 o O -6 150 FALSE -99.00
#> 16 16 116 p P -5 160 TRUE 3.14
#> 17 17 117 q Q -4 170 NA 2.00
#> 18 18 118 r R -3 180 FALSE NA
#> 19 19 119 s S -2 190 NA 0.00
#> 20 20 120 t T -1 200 FALSE -99.00
We also define a function to show the semi-join result arrayOp afl and download the data frame, for use in the examples below:
show = function(result_array_op){
print(result_array_op$to_afl())
result_array_op$to_df_all() # also download dimensions
}
Given an arrayOp object RefArray
and an R data frame or separate ArrayOp object df_or_arrayop
, the basic semi-join syntax is
RefArray$semi_join(df_or_arrayop,
field_mapping = NULL,
lower_bound = NULL,
upper_bound = NULL,
mode = "auto",
filter_threshold = 200L,
upload_threshold = 6000L)
The named arguments in the above template are their default values. The semi_join
operation can take one of several modes, and not every argument pertains to each mode.
df_or_arrayop
: the data frame or arrayOp of values to filter to; all columns other than the ones to be used as join keys for filtering should be removed or else automatic mode selection and the join logic may not behave as expected.mode
: the join algorithm to usefield_mapping
: a named list used in the ‘cross_between’ and ‘index_lookup’ modeslower_bound
: a named list used in the ‘cross_between’ and ‘filter’ modesupper_bound
: a named list used in the ‘cross_between’ and ‘filter’modesfilter_threshold
: a parameter used within automatic mode selection
upper_threshold
: a parameter used to determine whether how to pass the data frame df_or_arrayop
to SciDB. When the query data frame has cell count greater than filter_threshold
, but smaller than upload_threshold
, the data frame is converted to a build literal. Or if its greater than upload_threshold
, the data frame is uploaded as a persistent scidb array first.With the default setting of mode = 'auto'
, semi_join
follows the following rubric to determine the most appropriate join mode to use.
df_or_arrayop
is an R data frame and its number of cells is below filter_threshold
, use the ‘filter’ mode (this is the only use of the filter_threshold
argument); elseRefArray
, use the ‘cross_between’ mode; elseBe aware that sometimes the automatic mode selection may not pick the optimal mode, or even one that is compatible with the supplied values for the arguments field_mapping
, lower_bound
, and upper_bound
; if you are specifying any of these arguments it is best to pick the mode manually. For instance, if the supplied data frame is larger than the filter threshold and has multiple columns, one or more of which are not dimensions of RefArray
, automatic mode selection will pick equi_join
even if lower and upper bounds are given – which equi_join
cannot handle and so will throw an error.
In this mode, the join by the small data frame is unwrapped into a literal predicate.
RefArray$semi_join(
data.frame(da = c(3,5,8,11)), # number of cells <= filter_threshold
filter_threshold=10, upload_threshold=20) |>
show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,da=3 or da=5 or da=8 or da=11)"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 3 103 c C -18 30 FALSE NA
#> 2 5 105 e E -16 50 FALSE -99.00
#> 3 8 108 h H -13 80 FALSE NA
#> 4 11 111 k K -10 110 TRUE 3.14
# No matched cells
RefArray$semi_join(
data.frame(da = c(-10, -11, -12)),
filter_threshold=10, upload_threshold=20) |>
show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,da=-10 or da=-11 or da=-12)"
#> [1] da db lower upper f_int32 f_int64 f_bool f_double
#> <0 rows> (or 0-length row.names)
RefArray$semi_join(
data.frame(
f_int32 = c(-20, -17, -16),
lower = c("no_match", "d", "e")
),
filter_threshold = 10,
upload_threshold = 20
) |>
show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,(f_int32=-20 and lower='no_match') or (f_int32=-17 and lower='d') or (f_int32=-16 and lower='e'))"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 4 104 d D -17 40 NA 0
#> 2 5 105 e E -16 50 FALSE -99
RefArray$semi_join(
data.frame(da = c(-1, 4, 5),
lower = c("no_match", "d", "e")),
filter_threshold = 10,
upload_threshold = 20
) |>
show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,(da=-1 and lower='no_match') or (da= 4 and lower='d') or (da= 5 and lower='e'))"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 4 104 d D -17 40 NA 0
#> 2 5 105 e E -16 50 FALSE -99
The filter mode allows range joins where columns of df_or_arrayop
are the allowed lower or upper bounds for columns of the reference array. This requires the lower_bound
and upper_bound
arguments to be set, and optionally the field_mapping
argument.
RefArray$semi_join(
data.frame(da_low = c(1, 3), da_hi = c(5, 8)),
lower_bound = list('da' = 'da_low'),
upper_bound = list('da' = 'da_hi'),
filter_threshold = 10,
upload_threshold = 20
) |>
show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,(da>=1 and da<=5) or (da>=3 and da<=8))"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
#> 6 6 106 f F -15 60 TRUE 3.14
#> 7 7 107 g G -14 70 NA 2.00
#> 8 8 108 h H -13 80 FALSE NA
# Set lower/upper bounds to on different dimensions
RefArray$semi_join(
data.frame(da = c(1, 3), db = c(105, 108)),
lower_bound = list('da' = 'da'),
upper_bound = list('db' = 'db'),
filter_threshold = 10,
upload_threshold = 20
) |>
show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,(da>=1 and db<=105) or (da>=3 and db<=108))"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
#> 6 6 106 f F -15 60 TRUE 3.14
#> 7 7 107 g G -14 70 NA 2.00
#> 8 8 108 h H -13 80 FALSE NA
The cross_between
mode is available when the columns of the data frame (or attributes and dimensions of the arrayOp) are all dimensions of the reference array. This mode also, like filter, allows range joins where columns of df_or_arrayop
are the allowed lower or upper bounds for columns of the reference array.
Here, cross_between
is chosen because the 12 cells in the data frame are greater than the filter threshold, and because neither da
nor db
is an attribute of RefArray
:
RefArray$semi_join(
data.frame(da = c(1:5, -1), db = c(101:105, -1)),
filter_threshold = 10,
upload_threshold = 20
) |>
show()
#> [1] "cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da:int64,db:int64>[luue_], '[( 1,101),( 2,102),( 3,103),( 4,104),( 5,105),(-1, -1)]', true),_da_low,int64(da),_da_high,int64(da),_db_low,int64(db),_db_high,int64(db)),_da_low,_db_low,_da_high,_db_high))"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
Here, filter
is chosen instead since the data frame has only four cells:
RefArray$semi_join(
data.frame(da_low = c(1,3), da_hi = c(5, 8)),
lower_bound = list(da = 'da_low'),
upper_bound = list(da = 'da_hi'),
filter_threshold = 10,
upload_threshold = 20
) |>
show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,(da>=1 and da<=5) or (da>=3 and da<=8))"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
#> 6 6 106 f F -15 60 TRUE 3.14
#> 7 7 107 g G -14 70 NA 2.00
#> 8 8 108 h H -13 80 FALSE NA
Here, the query is already an arrayOp, and so the threshold is irrelevant; cross_between
mode is chosen since the query arrayOp has more than one attribute and neither da_low
nor da_hi
are attributes of RefArray
:
RefArray$semi_join(
conn$array_from_df(
data.frame(da_low = c(1, 3), da_hi = c(5, 8)),
"<da_low:int64, da_hi:int64> [anything]"
),
lower_bound = list(da = 'da_low'),
upper_bound = list(da = 'da_hi'),
) |>
show()
#> [1] "cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da_low:int64,da_hi:int64>[DlIE_], '[(1,5),(3,8)]', true),_da_low,da_low,_da_high,da_hi,_db_low,-4611686018427387902,_db_high,4611686018427387903),_da_low,_db_low,_da_high,_db_high))"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
#> 6 6 106 f F -15 60 TRUE 3.14
#> 7 7 107 g G -14 70 NA 2.00
#> 8 8 108 h H -13 80 FALSE NA
When some columns are to be used as numeric lower or upper bounds but happen to match the names of attributes of RefArray
, explicitly set field_mapping
to a named list that omits those columns:
RefArray$semi_join(
data.frame(da = c(1, 3), db = c(105, 108)),
lower_bound = list(da = 'da'),
upper_bound = list(db = 'db'),
field_mapping = list(),
filter_threshold = 2,
upload_threshold = 20
) |>
show()
#> [1] "cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da:int64,db:int64>[EXOM_], '[(1,105),(3,108)]', true),_da_low,da,_da_high,4611686018427387903,_db_low,-4611686018427387902,_db_high,db),_da_low,_db_low,_da_high,_db_high))"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
#> 6 6 106 f F -15 60 TRUE 3.14
#> 7 7 107 g G -14 70 NA 2.00
#> 8 8 108 h H -13 80 FALSE NA
RefArray$semi_join(
conn$array_from_df(
data.frame(da = c(1, 3), db = c(105, 108)),
template = RefArray
),
lower_bound = list(da = 'da'),
upper_bound = list(db = 'db'),
field_mapping = list()
) |>
show()
#> [1] "cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da:int64,db:int64>[BqqE_], '[(1,105),(3,108)]', true),_da_low,da,_da_high,4611686018427387903,_db_low,-4611686018427387902,_db_high,db),_da_low,_db_low,_da_high,_db_high))"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
#> 6 6 106 f F -15 60 TRUE 3.14
#> 7 7 107 g G -14 70 NA 2.00
#> 8 8 108 h H -13 80 FALSE NA
This mode is available when the query data is an R data frame with one column or an arrayOp with one attribute and one dimension (in which case the attribute is used for the join and dimension is ignored).
RefArray$semi_join(
data.frame(lower = letters[5:15]),
filter_threshold = 5,
upload_threshold = 20
) |> show()
#> [1] "project(filter(index_lookup(Rarrayop_37e37fd1a2bb,project(apply(build(<lower:string>[UrhF_], '[(\\'e\\'),(\\'f\\'),(\\'g\\'),(\\'h\\'),(\\'i\\'),(\\'j\\'),(\\'k\\'),(\\'l\\'),(\\'m\\'),(\\'n\\'),(\\'o\\')]', true),alt_lower_,lower),alt_lower_),lower,index_lower),index_lower is not null),lower,upper,f_int32,f_int64,f_bool,f_double)"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 5 105 e E -16 50 FALSE -99.00
#> 2 6 106 f F -15 60 TRUE 3.14
#> 3 7 107 g G -14 70 NA 2.00
#> 4 8 108 h H -13 80 FALSE NA
#> 5 9 109 i I -12 90 NA 0.00
#> 6 10 110 j J -11 100 FALSE -99.00
#> 7 11 111 k K -10 110 TRUE 3.14
#> 8 12 112 l L -9 120 NA 2.00
#> 9 13 113 m M -8 130 FALSE NA
#> 10 14 114 n N -7 140 NA 0.00
#> 11 15 115 o O -6 150 FALSE -99.00
RefArray$semi_join(
data.frame(da = 1:15),
filter_threshold = 5,
upload_threshold = 20
) |> show()
#> [1] "project(filter(index_lookup(apply(Rarrayop_37e37fd1a2bb,attr_da,da),project(apply(build(<da:int64>[Dvhi_], '[( 1),( 2),( 3),( 4),( 5),( 6),( 7),( 8),( 9),(10),(11),(12),(13),(14),(15)]', true),alt_da_,da),alt_da_),attr_da,index_da),index_da is not null),lower,upper,f_int32,f_int64,f_bool,f_double)"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
#> 6 6 106 f F -15 60 TRUE 3.14
#> 7 7 107 g G -14 70 NA 2.00
#> 8 8 108 h H -13 80 FALSE NA
#> 9 9 109 i I -12 90 NA 0.00
#> 10 10 110 j J -11 100 FALSE -99.00
#> 11 11 111 k K -10 110 TRUE 3.14
#> 12 12 112 l L -9 120 NA 2.00
#> 13 13 113 m M -8 130 FALSE NA
#> 14 14 114 n N -7 140 NA 0.00
#> 15 15 115 o O -6 150 FALSE -99.00
The equi_join
mode uses that SciDB operator to perform the filter on all matching attributes and/or dimensions, automatically determining on its own the particular equi_join
algorithm to use based on the array sizes. Since the output of equi_join
is materializing and has a different schema from RefArray
, semi_join
must redimension its output to have the expected dimensions (i.e. identical to the input RefArray
).
The equi_join
operator does not support range joins and will throw an error if field_mapping
, lower_bound
, or upper_bound
arguments are supplied – including in some cases where automatic mode selection fails because of unsatisfiable requirements.
The apply_join
mode is not current chosen by automatic mode selection, but can be used in place of equi_join
when the query data frame or array is not too large. At the cost of replicating this query data across all SciDB instances, its output is non-materializing, and does not need to be redimensioned to have the correct schema.
The apply_join
operator does not support range joins and will throw an error if field_mapping
, lower_bound
, or upper_bound
arguments are supplied
When df_or_arrayop
is an R data frame with more than upload_threshold
rows, a persistent array is created in SciDB. The reference to this array is contained in the resulting arrayOp instance. However, this reference isn’t passed along to additional chained operation that may be called. In order to prevent the query array from being garbage collected by R before the arrayOp is executed, you should hold onto a reference to the arrayOp that semi_join
returns.
# The following will work, due to only having a single
result_array <- RefArray$semi_join(
data.frame(da = c(1:5, -1), db = c(101:105, -1)),
filter_threshold = 10,
upload_threshold = 20)
result_array |> show()
#> [1] "cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da:int64,db:int64>[YVqG_], '[( 1,101),( 2,102),( 3,103),( 4,104),( 5,105),(-1, -1)]', true),_da_low,int64(da),_da_high,int64(da),_db_low,int64(db),_db_high,int64(db)),_da_low,_db_low,_da_high,_db_high))"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
rm(result_array)
# The following _would_ fail if we didn't create `temp_arr_1` around
result_array <- temp_arr_1 <- RefArray$semi_join(
data.frame(da = c(1:5, -1), db = c(101:105, -1)),
filter_threshold = 10,
upload_threshold = 20)
# Because we are assigning the result of this second `semi_join` to the same result_array
# variable, the reference to the original uploaded query data goes out of scope, and
# could be garbage collected, resulting the an 'array not found' error when you
# finally execute the arrayOp chain, unless we still have a
# reference to it (with `temp_arr_1` above).
result_array <- result_array$semi_join(data.frame(da = c(1:5)),
filter_threshold = 10,
upload_threshold = 20)
result_array |> show()
#> [1] "filter(cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da:int64,db:int64>[fFIV_], '[( 1,101),( 2,102),( 3,103),( 4,104),( 5,105),(-1, -1)]', true),_da_low,int64(da),_da_high,int64(da),_db_low,int64(db),_db_high,int64(db)),_da_low,_db_low,_da_high,_db_high)),da=1 or da=2 or da=3 or da=4 or da=5)"
#> da db lower upper f_int32 f_int64 f_bool f_double
#> 1 1 101 a A -20 10 TRUE 3.14
#> 2 2 102 b B -19 20 NA 2.00
#> 3 3 103 c C -18 30 FALSE NA
#> 4 4 104 d D -17 40 NA 0.00
#> 5 5 105 e E -16 50 FALSE -99.00
rm(result_array)
rm(temp_arr_1)