Semi Join

Overview

The purpose of the semi_join operator is, given an arrayOp RefArray, to return an arrayOp with the same schema as RefArray but with content filtered to those records matching a supplied R data from or second arrayOp instance on some subset of its dimensions and/or attributes.

Setup for examples

We first create a RefArray with content from ArrayContent data frame:

ArrayContent = data.frame(
  da=1:20, db=101:120,
  lower = letters[1:20], 
  upper = LETTERS[1:20],
  f_int32 = -20:-1, 
  f_int64 = 1:20 * 10.0, 
  f_bool = c(T,NA,F,NA,F), 
  f_double = c(3.14, 2.0, NA, 0, -99)
)

RefArray = conn$
  array_from_df(ArrayContent, 
  template =  "<lower:string COMPRESSION 'zlib', upper:string, f_int32:int32, f_int64:int64, f_bool: bool, f_double: double> 
      [da=0:*:0:*; db=0:*:0:*]", 
  force_template_schema = T)$ # Ensure RefArray has the exact schema as the template
  persist(.gc = FALSE)

RefArray$to_afl()
#> [1] "Rarrayop_37e37fd1a2bb"
RefArray$to_df_all()
#>    da  db lower upper f_int32 f_int64 f_bool f_double
#> 1   1 101     a     A     -20      10   TRUE     3.14
#> 2   2 102     b     B     -19      20     NA     2.00
#> 3   3 103     c     C     -18      30  FALSE       NA
#> 4   4 104     d     D     -17      40     NA     0.00
#> 5   5 105     e     E     -16      50  FALSE   -99.00
#> 6   6 106     f     F     -15      60   TRUE     3.14
#> 7   7 107     g     G     -14      70     NA     2.00
#> 8   8 108     h     H     -13      80  FALSE       NA
#> 9   9 109     i     I     -12      90     NA     0.00
#> 10 10 110     j     J     -11     100  FALSE   -99.00
#> 11 11 111     k     K     -10     110   TRUE     3.14
#> 12 12 112     l     L      -9     120     NA     2.00
#> 13 13 113     m     M      -8     130  FALSE       NA
#> 14 14 114     n     N      -7     140     NA     0.00
#> 15 15 115     o     O      -6     150  FALSE   -99.00
#> 16 16 116     p     P      -5     160   TRUE     3.14
#> 17 17 117     q     Q      -4     170     NA     2.00
#> 18 18 118     r     R      -3     180  FALSE       NA
#> 19 19 119     s     S      -2     190     NA     0.00
#> 20 20 120     t     T      -1     200  FALSE   -99.00

We also define a function to show the semi-join result arrayOp afl and download the data frame, for use in the examples below:

show = function(result_array_op){
  print(result_array_op$to_afl())
  result_array_op$to_df_all() # also download dimensions
}

Basic usage

Given an arrayOp object RefArray and an R data frame or separate ArrayOp object df_or_arrayop, the basic semi-join syntax is

RefArray$semi_join(df_or_arrayop, 
                   field_mapping = NULL,
                   lower_bound = NULL,
                   upper_bound = NULL,
                   mode = "auto",
                   filter_threshold = 200L,
                   upload_threshold = 6000L)

The named arguments in the above template are their default values. The semi_join operation can take one of several modes, and not every argument pertains to each mode.

df_or_arrayop: the data frame or arrayOp of values to filter to; all columns other than the ones to be used as join keys for filtering should be removed or else automatic mode selection and the join logic may not behave as expected.
mode: the join algorithm to use
field_mapping: a named list used in the ‘cross_between’ and ‘index_lookup’ modes
lower_bound: a named list used in the ‘cross_between’ and ‘filter’ modes
upper_bound: a named list used in the ‘cross_between’ and ‘filter’modes
filter_threshold: a parameter used within automatic mode selection
upper_threshold: a parameter used to determine whether how to pass the data frame df_or_arrayop to SciDB. When the query data frame has cell count greater than filter_threshold, but smaller than upload_threshold, the data frame is converted to a build literal. Or if its greater than upload_threshold, the data frame is uploaded as a persistent scidb array first.

Automatic mode selection

With the default setting of mode = 'auto', semi_join follows the following rubric to determine the most appropriate join mode to use.

If df_or_arrayop is an R data frame and its number of cells is below filter_threshold, use the ‘filter’ mode (this is the only use of the filter_threshold argument); else
If it is an R data frame with one column or arrayOp object with one attribute and one dimension, use the ‘index_lookup’ mode; else
If all its columns, or dimensions and attributes, are dimensions of RefArray, use the ‘cross_between’ mode; else
Use the ‘equi_join’ mode

Be aware that sometimes the automatic mode selection may not pick the optimal mode, or even one that is compatible with the supplied values for the arguments field_mapping, lower_bound, and upper_bound; if you are specifying any of these arguments it is best to pick the mode manually. For instance, if the supplied data frame is larger than the filter threshold and has multiple columns, one or more of which are not dimensions of RefArray, automatic mode selection will pick equi_join even if lower and upper bounds are given – which equi_join cannot handle and so will throw an error.

Available modes

Filter mode

In this mode, the join by the small data frame is unwrapped into a literal predicate.

By one field

RefArray$semi_join(
  data.frame(da = c(3,5,8,11)), # number of cells <= filter_threshold
  filter_threshold=10, upload_threshold=20) |>
  show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,da=3 or da=5 or da=8 or da=11)"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  3 103     c     C     -18      30  FALSE       NA
#> 2  5 105     e     E     -16      50  FALSE   -99.00
#> 3  8 108     h     H     -13      80  FALSE       NA
#> 4 11 111     k     K     -10     110   TRUE     3.14

# No matched cells
RefArray$semi_join(
  data.frame(da =  c(-10, -11, -12)),
  filter_threshold=10, upload_threshold=20) |>
  show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,da=-10 or da=-11 or da=-12)"
#> [1] da       db       lower    upper    f_int32  f_int64  f_bool   f_double
#> <0 rows> (or 0-length row.names)

By two fields

RefArray$semi_join(
  data.frame(
    f_int32 =  c(-20, -17, -16),
    lower = c("no_match", "d", "e")
  ),
  filter_threshold = 10,
  upload_threshold = 20
) |>
  show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,(f_int32=-20 and lower='no_match') or (f_int32=-17 and lower='d') or (f_int32=-16 and lower='e'))"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  4 104     d     D     -17      40     NA        0
#> 2  5 105     e     E     -16      50  FALSE      -99


RefArray$semi_join(
  data.frame(da =  c(-1, 4, 5),
             lower = c("no_match", "d", "e")),
  filter_threshold = 10,
  upload_threshold = 20
) |>
  show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,(da=-1 and lower='no_match') or (da= 4 and lower='d') or (da= 5 and lower='e'))"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  4 104     d     D     -17      40     NA        0
#> 2  5 105     e     E     -16      50  FALSE      -99

By lower/upper bounds to range query array dimensions

The filter mode allows range joins where columns of df_or_arrayop are the allowed lower or upper bounds for columns of the reference array. This requires the lower_bound and upper_bound arguments to be set, and optionally the field_mapping argument.

RefArray$semi_join(
  data.frame(da_low = c(1, 3), da_hi = c(5, 8)),
  lower_bound = list('da' = 'da_low'),
  upper_bound = list('da' = 'da_hi'),
  filter_threshold = 10,
  upload_threshold = 20
) |>
  show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,(da>=1 and da<=5) or (da>=3 and da<=8))"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  1 101     a     A     -20      10   TRUE     3.14
#> 2  2 102     b     B     -19      20     NA     2.00
#> 3  3 103     c     C     -18      30  FALSE       NA
#> 4  4 104     d     D     -17      40     NA     0.00
#> 5  5 105     e     E     -16      50  FALSE   -99.00
#> 6  6 106     f     F     -15      60   TRUE     3.14
#> 7  7 107     g     G     -14      70     NA     2.00
#> 8  8 108     h     H     -13      80  FALSE       NA

# Set lower/upper bounds to on different dimensions
RefArray$semi_join(
  data.frame(da = c(1, 3), db = c(105, 108)),
  lower_bound = list('da' = 'da'),
  upper_bound = list('db' = 'db'),
  filter_threshold = 10,
  upload_threshold = 20
) |>
  show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,(da>=1 and db<=105) or (da>=3 and db<=108))"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  1 101     a     A     -20      10   TRUE     3.14
#> 2  2 102     b     B     -19      20     NA     2.00
#> 3  3 103     c     C     -18      30  FALSE       NA
#> 4  4 104     d     D     -17      40     NA     0.00
#> 5  5 105     e     E     -16      50  FALSE   -99.00
#> 6  6 106     f     F     -15      60   TRUE     3.14
#> 7  7 107     g     G     -14      70     NA     2.00
#> 8  8 108     h     H     -13      80  FALSE       NA

cross_between mode

The cross_between mode is available when the columns of the data frame (or attributes and dimensions of the arrayOp) are all dimensions of the reference array. This mode also, like filter, allows range joins where columns of df_or_arrayop are the allowed lower or upper bounds for columns of the reference array.

Here, cross_between is chosen because the 12 cells in the data frame are greater than the filter threshold, and because neither da nor db is an attribute of RefArray:

RefArray$semi_join(
  data.frame(da = c(1:5, -1), db = c(101:105, -1)),
  filter_threshold = 10,
  upload_threshold = 20
) |>
  show()
#> [1] "cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da:int64,db:int64>[luue_], '[( 1,101),( 2,102),( 3,103),( 4,104),( 5,105),(-1, -1)]', true),_da_low,int64(da),_da_high,int64(da),_db_low,int64(db),_db_high,int64(db)),_da_low,_db_low,_da_high,_db_high))"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  1 101     a     A     -20      10   TRUE     3.14
#> 2  2 102     b     B     -19      20     NA     2.00
#> 3  3 103     c     C     -18      30  FALSE       NA
#> 4  4 104     d     D     -17      40     NA     0.00
#> 5  5 105     e     E     -16      50  FALSE   -99.00

Here, filter is chosen instead since the data frame has only four cells:

RefArray$semi_join(
  data.frame(da_low = c(1,3), da_hi = c(5, 8)),
  lower_bound = list(da = 'da_low'), 
  upper_bound = list(da = 'da_hi'),
  filter_threshold = 10,
  upload_threshold = 20
) |>
  show()
#> [1] "filter(Rarrayop_37e37fd1a2bb,(da>=1 and da<=5) or (da>=3 and da<=8))"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  1 101     a     A     -20      10   TRUE     3.14
#> 2  2 102     b     B     -19      20     NA     2.00
#> 3  3 103     c     C     -18      30  FALSE       NA
#> 4  4 104     d     D     -17      40     NA     0.00
#> 5  5 105     e     E     -16      50  FALSE   -99.00
#> 6  6 106     f     F     -15      60   TRUE     3.14
#> 7  7 107     g     G     -14      70     NA     2.00
#> 8  8 108     h     H     -13      80  FALSE       NA

Here, the query is already an arrayOp, and so the threshold is irrelevant; cross_between mode is chosen since the query arrayOp has more than one attribute and neither da_low nor da_hi are attributes of RefArray:

RefArray$semi_join(
  conn$array_from_df(
    data.frame(da_low = c(1, 3), da_hi = c(5, 8)),
    "<da_low:int64, da_hi:int64> [anything]"
  ),
  lower_bound = list(da = 'da_low'),
  upper_bound = list(da = 'da_hi'),
) |>
  show()
#> [1] "cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da_low:int64,da_hi:int64>[DlIE_], '[(1,5),(3,8)]', true),_da_low,da_low,_da_high,da_hi,_db_low,-4611686018427387902,_db_high,4611686018427387903),_da_low,_db_low,_da_high,_db_high))"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  1 101     a     A     -20      10   TRUE     3.14
#> 2  2 102     b     B     -19      20     NA     2.00
#> 3  3 103     c     C     -18      30  FALSE       NA
#> 4  4 104     d     D     -17      40     NA     0.00
#> 5  5 105     e     E     -16      50  FALSE   -99.00
#> 6  6 106     f     F     -15      60   TRUE     3.14
#> 7  7 107     g     G     -14      70     NA     2.00
#> 8  8 108     h     H     -13      80  FALSE       NA

When some columns are to be used as numeric lower or upper bounds but happen to match the names of attributes of RefArray, explicitly set field_mapping to a named list that omits those columns:

RefArray$semi_join(
  data.frame(da = c(1, 3), db = c(105, 108)),
  lower_bound = list(da = 'da'),
  upper_bound = list(db = 'db'),
  field_mapping = list(),
  filter_threshold = 2,
  upload_threshold = 20
) |>
  show()
#> [1] "cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da:int64,db:int64>[EXOM_], '[(1,105),(3,108)]', true),_da_low,da,_da_high,4611686018427387903,_db_low,-4611686018427387902,_db_high,db),_da_low,_db_low,_da_high,_db_high))"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  1 101     a     A     -20      10   TRUE     3.14
#> 2  2 102     b     B     -19      20     NA     2.00
#> 3  3 103     c     C     -18      30  FALSE       NA
#> 4  4 104     d     D     -17      40     NA     0.00
#> 5  5 105     e     E     -16      50  FALSE   -99.00
#> 6  6 106     f     F     -15      60   TRUE     3.14
#> 7  7 107     g     G     -14      70     NA     2.00
#> 8  8 108     h     H     -13      80  FALSE       NA

RefArray$semi_join(
  conn$array_from_df(
    data.frame(da = c(1, 3), db = c(105, 108)),
    template = RefArray
  ),
  lower_bound = list(da = 'da'),
  upper_bound = list(db = 'db'),
  field_mapping = list()
) |>
  show()
#> [1] "cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da:int64,db:int64>[BqqE_], '[(1,105),(3,108)]', true),_da_low,da,_da_high,4611686018427387903,_db_low,-4611686018427387902,_db_high,db),_da_low,_db_low,_da_high,_db_high))"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  1 101     a     A     -20      10   TRUE     3.14
#> 2  2 102     b     B     -19      20     NA     2.00
#> 3  3 103     c     C     -18      30  FALSE       NA
#> 4  4 104     d     D     -17      40     NA     0.00
#> 5  5 105     e     E     -16      50  FALSE   -99.00
#> 6  6 106     f     F     -15      60   TRUE     3.14
#> 7  7 107     g     G     -14      70     NA     2.00
#> 8  8 108     h     H     -13      80  FALSE       NA

index_lookup mode

This mode is available when the query data is an R data frame with one column or an arrayOp with one attribute and one dimension (in which case the attribute is used for the join and dimension is ignored).

By attribute:

RefArray$semi_join(
  data.frame(lower = letters[5:15]),
  filter_threshold = 5,
  upload_threshold = 20
) |> show()
#> [1] "project(filter(index_lookup(Rarrayop_37e37fd1a2bb,project(apply(build(<lower:string>[UrhF_], '[(\\'e\\'),(\\'f\\'),(\\'g\\'),(\\'h\\'),(\\'i\\'),(\\'j\\'),(\\'k\\'),(\\'l\\'),(\\'m\\'),(\\'n\\'),(\\'o\\')]', true),alt_lower_,lower),alt_lower_),lower,index_lower),index_lower is not null),lower,upper,f_int32,f_int64,f_bool,f_double)"
#>    da  db lower upper f_int32 f_int64 f_bool f_double
#> 1   5 105     e     E     -16      50  FALSE   -99.00
#> 2   6 106     f     F     -15      60   TRUE     3.14
#> 3   7 107     g     G     -14      70     NA     2.00
#> 4   8 108     h     H     -13      80  FALSE       NA
#> 5   9 109     i     I     -12      90     NA     0.00
#> 6  10 110     j     J     -11     100  FALSE   -99.00
#> 7  11 111     k     K     -10     110   TRUE     3.14
#> 8  12 112     l     L      -9     120     NA     2.00
#> 9  13 113     m     M      -8     130  FALSE       NA
#> 10 14 114     n     N      -7     140     NA     0.00
#> 11 15 115     o     O      -6     150  FALSE   -99.00

By dimension:

RefArray$semi_join(
  data.frame(da = 1:15),
  filter_threshold = 5,
  upload_threshold = 20
) |> show()
#> [1] "project(filter(index_lookup(apply(Rarrayop_37e37fd1a2bb,attr_da,da),project(apply(build(<da:int64>[Dvhi_], '[( 1),( 2),( 3),( 4),( 5),( 6),( 7),( 8),( 9),(10),(11),(12),(13),(14),(15)]', true),alt_da_,da),alt_da_),attr_da,index_da),index_da is not null),lower,upper,f_int32,f_int64,f_bool,f_double)"
#>    da  db lower upper f_int32 f_int64 f_bool f_double
#> 1   1 101     a     A     -20      10   TRUE     3.14
#> 2   2 102     b     B     -19      20     NA     2.00
#> 3   3 103     c     C     -18      30  FALSE       NA
#> 4   4 104     d     D     -17      40     NA     0.00
#> 5   5 105     e     E     -16      50  FALSE   -99.00
#> 6   6 106     f     F     -15      60   TRUE     3.14
#> 7   7 107     g     G     -14      70     NA     2.00
#> 8   8 108     h     H     -13      80  FALSE       NA
#> 9   9 109     i     I     -12      90     NA     0.00
#> 10 10 110     j     J     -11     100  FALSE   -99.00
#> 11 11 111     k     K     -10     110   TRUE     3.14
#> 12 12 112     l     L      -9     120     NA     2.00
#> 13 13 113     m     M      -8     130  FALSE       NA
#> 14 14 114     n     N      -7     140     NA     0.00
#> 15 15 115     o     O      -6     150  FALSE   -99.00

equi_join mode

The equi_join mode uses that SciDB operator to perform the filter on all matching attributes and/or dimensions, automatically determining on its own the particular equi_join algorithm to use based on the array sizes. Since the output of equi_join is materializing and has a different schema from RefArray, semi_join must redimension its output to have the expected dimensions (i.e. identical to the input RefArray).

The equi_join operator does not support range joins and will throw an error if field_mapping, lower_bound, or upper_bound arguments are supplied – including in some cases where automatic mode selection fails because of unsatisfiable requirements.

apply_join mode

The apply_join mode is not current chosen by automatic mode selection, but can be used in place of equi_join when the query data frame or array is not too large. At the cost of replicating this query data across all SciDB instances, its output is non-materializing, and does not need to be redimensioned to have the correct schema.

The apply_join operator does not support range joins and will throw an error if field_mapping, lower_bound, or upper_bound arguments are supplied

Uploading R data frames and temporary arrays

When df_or_arrayop is an R data frame with more than upload_threshold rows, a persistent array is created in SciDB. The reference to this array is contained in the resulting arrayOp instance. However, this reference isn’t passed along to additional chained operation that may be called. In order to prevent the query array from being garbage collected by R before the arrayOp is executed, you should hold onto a reference to the arrayOp that semi_join returns.

# The following will work, due to only having a single 
result_array <- RefArray$semi_join(
                  data.frame(da = c(1:5, -1), db = c(101:105, -1)),
                  filter_threshold = 10,
                  upload_threshold = 20)

result_array |> show()
#> [1] "cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da:int64,db:int64>[YVqG_], '[( 1,101),( 2,102),( 3,103),( 4,104),( 5,105),(-1, -1)]', true),_da_low,int64(da),_da_high,int64(da),_db_low,int64(db),_db_high,int64(db)),_da_low,_db_low,_da_high,_db_high))"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  1 101     a     A     -20      10   TRUE     3.14
#> 2  2 102     b     B     -19      20     NA     2.00
#> 3  3 103     c     C     -18      30  FALSE       NA
#> 4  4 104     d     D     -17      40     NA     0.00
#> 5  5 105     e     E     -16      50  FALSE   -99.00
rm(result_array)

# The following _would_ fail if we didn't create `temp_arr_1` around 
result_array <- temp_arr_1 <- RefArray$semi_join(
                                data.frame(da = c(1:5, -1), db = c(101:105, -1)),
                                filter_threshold = 10,
                                upload_threshold = 20)

# Because we are assigning the result of this second `semi_join` to the same result_array
#   variable, the reference to the original uploaded query data goes out of scope, and
#   could be garbage collected, resulting the an 'array not found' error when you 
#   finally execute the arrayOp chain, unless we still have a 
#   reference to it (with `temp_arr_1` above). 
result_array <- result_array$semi_join(data.frame(da = c(1:5)),
                                       filter_threshold = 10,
                                       upload_threshold = 20)
result_array |> show()
#> [1] "filter(cross_between(Rarrayop_37e37fd1a2bb,project(apply(build(<da:int64,db:int64>[fFIV_], '[( 1,101),( 2,102),( 3,103),( 4,104),( 5,105),(-1, -1)]', true),_da_low,int64(da),_da_high,int64(da),_db_low,int64(db),_db_high,int64(db)),_da_low,_db_low,_da_high,_db_high)),da=1 or da=2 or da=3 or da=4 or da=5)"
#>   da  db lower upper f_int32 f_int64 f_bool f_double
#> 1  1 101     a     A     -20      10   TRUE     3.14
#> 2  2 102     b     B     -19      20     NA     2.00
#> 3  3 103     c     C     -18      30  FALSE       NA
#> 4  4 104     d     D     -17      40     NA     0.00
#> 5  5 105     e     E     -16      50  FALSE   -99.00
rm(result_array)
rm(temp_arr_1)

Paradigm4

2025-04-10