R/array_op_base.R
ArrayOpBase.Rd
ArrayOp class instances denote scidb array operations and operands, hence the name.
Operands can be plain scidb array names or (potentially nested) operations on arrays.
Most ArrayOp class methods return a new ArrayOp instance and the original instance on which methods are invoked from remains the same, i.e. ArrayOp instances are immutable.
One ArrayOp operation may involve one or multiple scidb operators and any number of operands. in another operation. Operands and Opreration results can all be denoted by ArrayOp.
Sub-classes of ArrayOpBase deal with any syntax or operator changes in different SciDB version so that the ArrayOpBase class can provide a unified API on all supported SciDB versions. Currently SciDB V18 and V19 are supported.
Users of arrayop
package shouldn't be concerned with a specific sub-class since
the ScidbConnection
object automatically chooses the correct class version
and creates instances based on the scidb version it connects to.
Get arrayOp instances from the default ScidbConnection object.
See arrayop::get_default_connection
for details.
dims
Dimension names
attrs
Attribute names
selected
Selected dimension and/or attribute names
dtypes
A named list, where key is dim/attr name and value is respective SciDB data type as string
raw_dtypes
A named list, where key is dim/attr name and value is first part of respective SciDB data type as string
dims_n_attrs
Dimension and attribute names
attrs_n_dims
Attribute and dimension names
is_schema_from_scidb
If the array schema is retrieved from SciDB or inferred locally in R
is_scidb_data_frame
Whether current array_op is a regular array or SciDB data frame (array with hidden dimensions; not to be confused with R data frames)
.private
For internal testing only. Do not access this field to avoid unintended consequences!!!
new()
Base class initialize function, to be called in sub-class internally.
Always use ScidbConnection
to get array_op instances.
ArrayOpBase$new(
raw_afl,
dims = as.character(c()),
attrs = as.character(c()),
dtypes = list(),
dim_specs = list(),
...,
meta_list
)
raw_afl
AFL expression (array name or operations) as string
dims
A string vector used as dimension names
attrs
A string vector used as attribute names
dtypes
A named list of strings, where names are attribute names and
values are full scidb data types.
E.g. dtypes = list(field_str = "string NOT NULL", field_int32 = "int32")
dim_specs
A named list of string, where names are dimension names
and values are dimension specs.
E.g. dim_specs = list(da = "0:*:0:*", chrom = "1:24:0:1")
.
...
A named list of metadata items, where names are used as keys
in private$set_meta
and private$get_meta
functions.
meta_list
A list that stores ArrayOp meta data, e.g. field types If provided, other regular params are not allowed.
inherit_refs()
Add the references of a another ArrayOpBase object to this object
To be used when creating a new ArrayOpBase object that is a function of self but multiple other objects as well
filter()
Create a new ArrayOp instance with filter expressions
Similar to dplyr::filter
, fields are not quoted.
Operators for any type of fields include ==
, !=
,
%in%
, %not_in%
.
To test whether a field is null, use unary operators: is_null
, not_null
.
Special binary operators for string fields include:
%contains%
, %starts_with%
, %ends_with%
, %like%
, where
only %like%
takes a regular expression and other operators escape any special
characters in the right operand.
Operators for numeric fields include: >
, <
, >=
, <=
...
Filter expression(s) in R syntax. These expression(s) are not evaluated in R but first captured then converted to scidb expressions with appropriate syntax.
.expr
A single R expression, or a list of R exprs, or NULL.
If provided, ...
is ignored. Multiple exprs are joined by 'and'.
This param is useful when we want to pass an already captured R expression.
.validate_fields
Boolean, default TRUE, whether to validate fields in filter expressions. Throw error if invalid fields exist when set to TRUE.
.regex_func
deprecated option
.ignore_case
deprecated option
mutate()
Create a new ArrayOp instance with mutated fields
Similar to dplyr::mutate
, fields of source (self) can be removed or added to the result arrayOp
Any field that are not in the mutate expressions remain unchanged.
...
Named R expressions. Names are field names in the result arrayOp and must not be empty.
Set field = NULL to remove existing fields. E.g. abcd = NULL, def = def
removes
field 'abcd' and keep field 'def'.
Values are R expressions similar to the filter
method.
E.g. a = b + 2, name = first + "-" + last, chrom = if(chrom == 'x') 23 else if(chrom == 'y') 24 else chrom
.dots
A list of SciDBR expressions, R expressions, or NULL. If provided,
the ...
param is ignored. Useful when an a list of mutation expressions is already
created and can be passed around.
.sync_schema
Whether to get the exact schema from scidb. Default TRUE will cause a scidb query to get the schema. Set to FALSE to avoid schema checking.
transmute()
Create a new ArrayOp instance with mutated fields
Similar to dplyr::transmute
, only listed fields are retained in the result arrayOp
NOTE: Any field that are not in the mutate expressions will be discarded.
...
R expressions. Names are optional. For each named expression, the name is used as field name in the result arrayOp. Unnamed expressions must be existing field names, unquoted, which result in unchanged source fields of self.
Values are R expressions similar to the filter
method.
E.g. a = b + 2, name = first + "-" + last, chrom = if(chrom == 'x') 23 else if(chrom == 'y') 24 else chrom
.dots
A named list of SciDBR expressions, R expressions, or NULL. If provided,
the ...
param is ignored. Useful when an a list of mutation expressions is already
created and can be passed around.
.sync_schema
Whether to get the exact schema from scidb. Default TRUE will cause a scidb query to get the schema. Set to FALSE to avoid schema checking.
mutate_by()
Create a ArrayOp instance with the same schema of self, but different cells from 'data_array' for the 'updated_fields'.
ArrayOpBase$mutate_by(
data_array,
keys = NULL,
updated_fields = NULL,
.redimension_setting = NULL,
.join_setting = NULL
)
data_array
An ArrayOp instance that have at least two overlapping fields with self.
keys
Field names in both self and data_array. Cell content of these fields are from the 'self' arrayOp rather than 'data_array'.
updated_fields
Field names in both self and data_array. Cell content of these fields are from the 'data_array', NOT 'self'.
.redimension_setting
A list of strings used as the settings of scidb 'redimension' operator. Only applicable when a 'redimension' is needed.
.join_setting
A list of strings used as the settings of scidb 'join' operator. Only applicable when a 'join' is needed.
inner_join()
Inner join two arrays: 'self' (left) and 'right'
Similar to dplyr::inner_join
, the result arrayOp performs an inner join.
For both left and right arrays, only selected fields are included in the result arrayOp.
If no fields are selected, then all fields are treated as selected.
ArrayOpBase$inner_join(
right,
by.x = NULL,
by.y = NULL,
by = NULL,
left_alias = "_L",
right_alias = "_R",
join_mode = "equi_join",
settings = NULL
)
right
An arrayOp instance
by.x
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'.
by.y
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'.
by
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'. If not NULL, must be fields of both operands.
left_alias
Alias for left array to resolve potential conflicting fields in result
right_alias
Alias for right array to resolve potential conflicting fields in result
join_mode
String 'equi_join', 'apply_join', or 'cross_join'. The second always replicates the right-hand array to all instances, with the benefit of non-materializing result in scidb. The third requires the join keys to all be dimensions of both operands, more stringent than 'equi_join' but again with the benefit of non-materializing result in scidb.
settings
A named list as join settings. E.g. list(algorithm = "'hash_replicate_right'")
left_join()
Left join two arrays: 'self' (left) and 'right'
Similar to dplyr::left_join
, the result arrayOp performs a left join.
For both left and right arrays, only selected fields are included in the result arrayOp.
If no fields are selected, then all fields are treated as selected.
ArrayOpBase$left_join(
right,
by.x = NULL,
by.y = NULL,
by = NULL,
left_alias = "_L",
right_alias = "_R",
join_mode = "equi_join",
settings = NULL
)
right
An arrayOp instance
by.x
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'.
by.y
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'.
by
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'. If not NULL, must be fields of both operands.
left_alias
Alias for left array to resolve potential conflicting fields in result
right_alias
Alias for right array to resolve potential conflicting fields in result
join_mode
String 'equi_join' or 'apply_join'. The second always replicates the right-hand array to all instances, with the benefit of non-materializing result in scidb.
settings
A named list as join settings. E.g. list(algorithm = "'hash_replicate_right'")
right_join()
Right join two arrays: 'self' (left) and 'right'
Similar to dplyr::right_join
, the result arrayOp performs a right join.
For both left and right arrays, only selected fields are included in the result arrayOp.
If no fields are selected, then all fields are treated as selected.
ArrayOpBase$right_join(
right,
by.x = NULL,
by.y = NULL,
by = NULL,
left_alias = "_L",
right_alias = "_R",
settings = NULL
)
right
An arrayOp instance
by.x
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'.
by.y
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'.
by
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'. If not NULL, must be fields of both operands.
left_alias
Alias for left array to resolve potential conflicting fields in result
right_alias
Alias for right array to resolve potential conflicting fields in result
settings
A named list as join settings. E.g. list(algorithm = "'hash_replicate_right'")
full_join()
Full join two arrays: 'self' (left) and 'right'
Similar to dplyr::full_join
, the result arrayOp performs a full join.
For both left and right arrays, only selected fields are included in the result arrayOp.
If no fields are selected, then all fields are treated as selected.
ArrayOpBase$full_join(
right,
by.x = NULL,
by.y = NULL,
by = NULL,
left_alias = "_L",
right_alias = "_R",
settings = NULL
)
right
An arrayOp instance
by.x
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'.
by.y
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'.
by
NULL or a string vector as join keys. If set to NULL, join keys are inferred as shared fields of 'left' and 'right'. If not NULL, must be fields of both operands.
left_alias
Alias for left array to resolve potential conflicting fields in result
right_alias
Alias for right array to resolve potential conflicting fields in result
settings
A named list as join settings. E.g. list(algorithm = "'hash_replicate_right'")
semi_join()
Return an arrayOp instance with same schema as self and content cells that match the cells of 'df_or_arrayop'.
Similar to dplyr::semi_join
, the result has the same schema as the left
operand 'self' and with content filtered by 'df_or_arrayop'.
params field_mapping
, lower_bound
and upper_bound
, if provided, must be named list,
where the.names are from the source array (i.e. self), and values are from
the right operand df_or_arrayop
ArrayOpBase$semi_join(
df_or_arrayop,
field_mapping = NULL,
lower_bound = NULL,
upper_bound = NULL,
mode = "auto",
filter_threshold = 200L,
upload_threshold = 6000L
)
df_or_arrayop
An R data frame or arrayOp instance.
field_mapping
NULL or a named list of strings. Only applicable when mode is 'index_lookup' or 'cross_between', ignored in other modes.
lower_bound
NULL or a named list of strings. Only applicable when
mode is 'filter' or 'cross_between'. Names of the list are fields of self,
and value strings are fields or columns of the df_or_arrayop
which are
treated as lower bound to matching fields rather than exact match.
In 'filter' mode, the self fields in lower_bound can be any numeric fields. In 'cross_between' mode, the self fields in lower_bound must be array dimensions.
upper_bound
NULL or a named list of strings. Only applicable when
mode is 'filter' or 'cross_between'. Names of the list are fields of self,
and value strings are fields or columns of the df_or_arrayop
which are
treated as upper bound to matching fields rather than exact match.
In 'filter' mode, the self fields in upper_bound can be any numeric fields. In 'cross_between' mode, the self fields in upper_bound must be array dimensions.
mode
String of 'filter', 'cross_between', 'index_lookup', 'equi_join' or 'auto'
filter_threshold
A number below which the 'filter' mode is used unless a mode other than 'auto' is provided.
upload_threshold
A number below which the 'df_or_arrayop' data frame is compiled into a build literal array; otherwise uploaded to scidb as a regular array. Only applicable when 'df_or_arrayop' is an R data frame.
group_by()
Create a new arrayOp with 'group by' fields
The result arrayOp is identical to self except for the 'group_by' fields.
When called before summarize
function, result arrayOp will be converted into grouped_aggregate
operation.
summarize()
Create a new arrayOp with aggregated fields
...
aggregation expressions in R syntax. Names of expressions are optional.
If provided, names will be the fields of result arrayOp; otherwise field
names are auto generated by scidb.
Same syntax as ...
in 'filter' and 'mutate' functions.
.dots
a list of aggregation expressions. Similar to '.dots' in 'mutate' and 'transmute'.
set_auto_fields()
Create a new ArrayOp instance that has auto incremented fields and/or anti-collision fields according to a template arrayOp
If the dimension count, attribute count and data types match between the source(self) and target, then no redimension will be performed, otherwise redimension on the source first.
Redimension mode requires all target fields exist on the source disregard of being attributes or dimensions. Redimension mode does not check on whether source data types match the target because auto data conversion occurs within scidb where necessary/applicable.
ArrayOpBase$set_auto_fields(
target,
source_auto_increment = NULL,
target_auto_increment = NULL,
anti_collision_field = NULL,
join_setting = NULL,
source_anti_collision_dim_spec = NULL
)
target
A target ArrayOp the source data is written to.
source_auto_increment
A single named integer, a single string, or NULL. Eg. c(z=0) for field 'z' in the source (ie. self) starting from 0; or a single string 'z' equivalent to c(z=0). If NULL, assume it to be the only dimension in self, normally from an artificial dimension of a build literal or unpack operation.
target_auto_increment
a named number vector or string vector or NULL.
where the name is a target field and value is the starting index.
E.g. c(aid=0, bid=1) means to set auto fields 'aid', 'bid' according to the target fields of the same name.
If 'target' doesn't have a cell, then default values start from 0 and 1 for aid and bid, respectively.
A string vector c("aid", "bid") is equvilant to c(aid=0, bid=0).
NULL means treat all missing fields (absent in self but present in target) as 0-based auto increment fields.
Here the target_auto_increment
param only affects the initial load when the field is still null in the target array.
anti_collision_field
a target dimension name which exsits only to resolve cell collision (ie. cells with the same dimension coordinate).
join_setting
NULL or a named list. When not NULL, it is converted
to settings for scidb equi_join
operator, only applicable when
anti_collision_field
is not NULL.
source_anti_collision_dim_spec
NULL or a string.
If NULL, the dimension spec for the anti-collision dimension in source
(self) is taken from self's schema.
In rare cases, we need to set the dimension spec to control the chunk size
in the 'redimension' operation, e.g. source_anti_collision_dim_spec = "0:*:0:123456"
update()
Update the target array with self's content
Similar behavior to scidb insert operator. Require numbers of attributes and dimensions of self and target arrays match. Field names are irrelevant.
This function only returns an arrayOp with the update operation AFL
encapsulated. No real action is performed in scidb until
source$update(target)$execute()
is called.
overwrite()
Overwrite the target array with self's content
Similar behavior to scidb store operator. Require numbers of attributes and dimensions of self and target arrays match. Field names are irrelevant.
This function only returns an arrayOp with the update operation AFL
encapsulated. No real action is performed in scidb until
source$overwrite(target)$execute()
is called.
Warning: Target's content will be erased and filled with self's content.
delete_cells()
Create a new ArrayOp instance that encapsulates a delete operation
Implemented by scidb delete
operator.
Operators for any type of fields include ==
, !=
,
%in%
, %not_in%
.
To test whether a field is null, use unary operators: is_null
, not_null
.
Special binary operators for string fields include:
%contains%
, %starts_with%
, %ends_with%
, %like%
, where
only %like%
takes a regular expression and other operators escape any special
characters in the right operand.
Operators for numeric fields include: >
, <
, >=
, <=
...
Filter expression(s) in R syntax. These expression(s) are not evaluated in R but first captured then converted to scidb expressions with appropriate syntax.
.expr
A single R expression, or a list of R exprs, or NULL.
If provided, ...
is ignored. Multiple exprs are joined by 'and'.
This param is useful when we want to pass an already captured R expression.
.regex_func
deprecated option
.ignore_case
deprecated option
select()
Create a new ArrayOp instance with selected fields
NOTE: this does NOT change the to_afl output, but explicitly state which field(s) should be retained if used in
a parent operation that changes its schema, e.g. inner_join
, left_join
, right_join
and to_df
.
The select
ed fields are passed on to derived ArrayOp instances.
In all join operations, if no field is explicitly select
ed, then all fields are assumed be retained.
In to_df
, if no field is explicitly select
ed, only the attributes are retrieved as data frame columns.
In to_df_all
, if no field is explicitly select
ed, it is equivalent to select all dimensions and attributes.
persist()
Persist array operation as scidb array
If self
is a persistent array and no save_array_name
provided, then
self
is returned.
Otherwise, save self's AFL as a new scidb array. This includes two cases:
self
is persistent and save_array_name
is provided, i.e. explicit persistence
self
is array operation(s), then a new array is created regardless of save_array_name
From users perspective,
When we need to ensure a handle to a persistent array and do not care
whether it is a new or existing array, we should leave out
save_array_name
to avoid unnecessary array copying. E.g. conn$array_from_df
may return a build literal or uploaded persistent array, call
conn$array_from_df(...)$persist()
to ensure a persistent array.
When we need to backup an array, then provide a save_array_name
explicitly.
Parameters .gc
and .temp
are only applicable when a new array is created.
save_array_name
NULL or String. The new array name to save self's AFL as. If NULL, the array name is randomly generated when a new array is created.
.temp
Boolean, default FALSE. Whether to create a temporary scidb array.
.gc
Boolean, default TRUE. Whether to remove the persisted scidb array once the encapsulating arrayOp goes out of scidb in R. Set to FALSE if we need to keep the array indefinitely.
change_schema()
Create a new ArrayOp instance whose schema is the same as the template
.
This operation throws away any fields that do not exist in template
while keeping the self
's data of the
matching fields.
Implemented by scidb redimension
operator, but it allows for partial-fields match if strict=F
.
template
an ArrayOp instance as the schema template.
strict
If TRUE(default), requires self
has all the template
fields.
.setting
a string vector, where each item will be appended to the redimension operand. E.g. .setting = c('false', 'cells_per_chunk: 1234') ==> redimension(source, template, false, cells_per_chunk: 1234)
drop_dims()
Create a new arrayOp by dropping dimensions of 'self'.
Use mode = 'unpack'
to still keep an artificial dimension in result arrayOp.
The dimension is 0-based, auto-incremented up until self$cell_count() - 1
.
'unpack' mode is useful in taking advantage of this artificial dimension to
auto populate other fields, e.g. in set_auto_fields
.
Use mode = 'flatten'
to return a scidb data frame which has no explicit dimensions.
Result arrayOp in both modes has attributes of self's attributes and dimensions.
ArrayOpBase$drop_dims(
mode = "unpack",
.chunk_size = NULL,
.unpack_dim = dbutils$random_field_name()
)
mode
String 'unpack' (default) or 'flatten'.
.chunk_size
NULL or an integer. Converted to the 'chunk_size' param in 'unpack' mode; and 'cells_per_chunk' in 'flatten' mode.
.unpack_dim
NULL (default) or string as the dimension if 'unpack' mode is chosen. NULL defaults to a random field name.
sync_schema()
Create a new arrayOp with actual schema from SciDB or 'self' if
self$is_schema_from_scidb == T
.
Useful in confirming the schema of complex array operations. If the array schema is already retrieved from SciDB, then just return self.
spawn()
Create a new ArrayOp instance using 'self' as a template
This function is mainly for array schema string generation when we want to rename, add, and/or exclude certain fields of self, but still keep other unspecified fields unchanged.
Data types and dimension specs of existing fields are inherited from 'self' unless provided explicitly. New field data types default to NAs unless provided explicitly.
This function is normally used internally for arrayOp generation.
ArrayOpBase$spawn(
afl_str = "spawned array_op (as template only)",
renamed = NULL,
added = NULL,
excluded = NULL,
dtypes = NULL,
dim_specs = NULL
)
afl_str
An AFL expression. In case of using the spawned result as a schema template only, the afl_str does not need to be provided. Otherwise, it should conform with the actual resultant arrayOp instance, which is very rare.
renamed
A list of renamed fields where names are old fields and values are new field names.
added
New fields added to result arrayOp. String vector or NULL.
excluded
Fields excluded from self
. String vector or NULL.
dtypes
NULL or a named list of data types for fields of the result arrayOp, where names are field names, values (strings) are data types.
dim_specs
NULL or a named list of array dimension specs, where names are dimension names, values (strings) are dimension specs in scidb format.
to_afl()
AFL string encapsulated by of the self ArrayOp
AFL can be either an scidb array name or array operation(s) on array(s).
The ArrayOp instance may have 'selected' fields but they are not reflected in the result.
'selected' fields are not reflected here, but determines which fields are retained in to_df()
calls.
to_schema_str()
Return a schema representation of the ArrayOp <attr1:type1 [, attr2:type2 ...]> [dim1 [;dim2]]
Unless sync_schema()
is called, the schema may be inferred locally in R to save round trips between R and SciDB server.
SciDB data frames have hidden dimensions that start with $
limit()
Create a new arrayOp that encapsulate AFL for the first n
cells of 'self'
We still need to append a to_df
call to download the result as data frame.
summarize_array()
Return a data frame of the summary of the 'self' array
Implemented by scidb 'summarize' operator
version()
Get an arrayOp instance that encapsulates a version snapshot of a persistent scidb array
The function does not perform version check in scidb. It only construct an arrayOp locally to represent a specific version. If a non-existent version_id is later used in scidb related operations, an error will be thrown by SciDB.
is_persistent()
Returns whether the current arrayOp instance encapsulates a persistent scidb array name that may or may not exist on the scidb server
No checking with scidb server is performed. Only validate the arrayOp's AFL with regex and see if it matches an array name. E.g. "myNamespace.myArray" or "myArrayInPublicNamespace".
exists_persistent_array()
Returns whether the current arrayOp instance encapsulates a persistent scidb array that exists on the scidb server
If current arrayOp encapsulates an array operation, then it returns FALSE without checking with scidb server.
array_meta_data()
Download the array meta as an R data frame
The array metadata is retrieved from executing the scidb 'show' operator in the array namespace and match for the current array name. Array metadata include fields: "name", "uaid", "aid", "schema", "availability", "temporary", "namespace", "distribution", "etcomp"
remove_versions()
Remove array versions of self
Only applicable to persistent arrays. Warning: This function will be executed effectively in scidb without extra 'execute()' and cannot be undone.
remove_array()
Remove array versions of self
Only applicable to persistent arrays. Warning: This function will be executed effectively in scidb without extra 'execute()' and cannot be undone.