You are currently browsing the monthly archive for January 2017.
Today we are pleased to release version 1.1.1 of xml2. xml2 makes it easy to read, create, and modify XML with R. You can install it with:
install.packages("xml2")
As well as fixing many bugs, this release:
- Makes it easier to create an modify XML
- Improves roundtrip support between XML and lists
- Adds support for XML validation and XSLT transformations.
You can see a full list of changes in the release notes. This is the first release maintained by Jim Hester.
Creating and modifying XML
xml2 has been overhauled with a set of methods to make generating and modfying XML easier:
xml_new_root()
can be used to create a new document and root node simultaneously.xml_new_root("x") %>% xml_add_child("y") %>% xml_root() #> {xml_document} #> <x> #> [1] <y/>
- New
xml_set_text()
,xml_set_name()
,xml_set_attr()
, andxml_set_attrs()
make it easy to modify nodes within a pipeline.x <- read_xml("<a> <b /> <c><b/></c> </a>") x #> {xml_document} #> <a> #> [1] <b/> #> [2] <c>\n <b/>\n</c> x %>% xml_find_all(".//b") %>% xml_set_name("banana") %>% xml_set_attr("oldname", "b") x #> {xml_document} #> <a> #> [1] <banana oldname="b"/> #> [2] <c>\n <banana oldname="b"/>\n</c>
- New
xml_add_parent()
makes it easy to insert a node as the parent of an existing node. -
You can create more esoteric node types with
xml_comment()
(comments),xml_cdata()
(CDATA nodes), andxml_dtd()
(DTDs).
Coercion to and from R Lists
xml2 1.1.1 improves support for converting to and from R lists, thanks in part to work by Peter Foley and Jenny Bryan. In particular xml2 now supports preserving the root node name as well as saving all xml2 attributes as R attributes. These changes allows you to convert most XML documents to and from R lists with as_list()
and as_xml_document()
without loss of data.
x <- read_xml("<fruits><apple color = 'red' /></fruits>")
x
#> {xml_document}
#> <fruits>
#> [1] <apple color="red"/>
as_list(x)
#> $apple
#> list()
#> attr(,"color")
#> [1] "red"
as_xml_document(as_list(x))
#> {xml_document}
#> <apple color="red">
XML validation and xslt
xml2 1.1.1 also adds support for XML validation, thanks to Jeroen Ooms. Simply read the document and schema files and call xml_validate()
.
doc <- read_xml(system.file("extdata/order-doc.xml", package = "xml2"))
schema <- read_xml(system.file("extdata/order-schema.xml", package = "xml2"))
xml_validate(doc, schema)
#> [1] TRUE
#> attr(,"errors")
#> character(0)
Jeroen also released the first xml2 extension package in conjunction with xml2 1.1.1, xslt. xslt allows one to apply XSLT (Extensible Stylesheet Language Transformations) to XML documents, which are great for transforming XML data into other formats such as HTML.
We’re happy to announce that version 0.5 of the sparklyr package is now available on CRAN. The new version comes with many improvements over the first release, including:
- Extended dplyr support by implementing:
do()
andn_distinct()
. - New functions including
sdf_quantile()
,ft_tokenizer()
andft_regex_tokenizer()
. - Improved compatibility, sparklyr now respects the value of the ‘na.action’ R option and
dim()
,nrow()
andncol()
. - Experimental support for Livy to enable clients, including RStudio, to connect remotely to Apache Spark.
- Improved connections by simplifying initialization and providing error diagnostics.
- Certified sparklyr, RStudio Server Pro and ShinyServer Pro with Cloudera.
- Updated spark.rstudio.com with new deployment examples and a sparklyr cheatsheet.
Additional changes and improvements can be found in the sparklyr NEWS file.
For questions or feedback, please feel free to open a sparklyr github issue or a sparklyr stackoverflow question.
Extended dplyr support
sparklyr 0.5
adds supports for n_distinct()
as a faster and more concise equivalent of length(unique(x))
and also adds support for do()
as a convenient way to perform multiple serial computations over a group_by()
operation:
library(sparklyr) sc <- spark_connect(master = "local") mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE) by_cyl <- group_by(mtcars_tbl, cyl) fit_sparklyr <- by_cyl %>% do(mod = ml_linear_regression(mpg ~ disp, data = .)) # display results fit_sparklyr$mod
In this case, .
represents a Spark DataFrame, which allows us to perform operations at scale (like this linear regression) for a small set of groups. However, since each group operation is performed sequentially, it is not recommended to use do()
with a large number of groups. The code above performs multiple linear regressions with the following output:
[[1]]
Call: ml_linear_regression(mpg ~ disp, data = .)
Coefficients:
(Intercept) disp
19.081987419 0.003605119
[[2]]
Call: ml_linear_regression(mpg ~ disp, data = .)
Coefficients:
(Intercept) disp
40.8719553 -0.1351418
[[3]]
Call: ml_linear_regression(mpg ~ disp, data = .)
Coefficients:
(Intercept) disp
22.03279891 -0.01963409
It’s worth mentioning that while sparklyr
provides comprehensive support for dplyr
, dplyr
is not strictly required while using sparklyr
. For instance, one can make use of DBI
without dplyr
as follows:
library(sparklyr) library(DBI) sc <- spark_connect(master = "local") sdf_copy_to(sc, iris) dbGetQuery(sc, "SELECT * FROM iris LIMIT 4")
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
New functions
The new sdf_quantile()
function computes approximate quantiles (to some relative error), while the new ft_tokenizer()
and ft_regex_tokenizer()
functions split a string by white spaces or regex patterns.
For example, ft_tokenizer()
can be used as follows:
library(sparklyr) library(janeaustenr) library(dplyr) sc %>% spark_dataframe() %>% na.omit() %>% ft_tokenizer(input.col = “text”, output.col = “tokens”) %>% head(4)
Which produces the following output:
text book tokens <chr> <chr> <list> 1 SENSE AND SENSIBILITY Sense & Sensibility <list [3]> 2 Sense & Sensibility <list [1]> 3 by Jane Austen Sense & Sensibility <list [3]> 4 Sense & Sensibility <list [1]>
Tokens can be further processed through, for instance, HashingTF.
Improved compatibility
‘na.action’ is a parameter accepted as part of the ‘ml.options’ argument, which defaults to getOption("na.action", "na.omit")
. This allows sparklyr
to match the behavior of R while processing NA records, for instance, the following linear model drops NA record appropriately:
library(sparklyr)
library(dplyr)
library(nycflights13)
sc <- spark_connect(master = "local")
flights_clean <- na.omit(copy_to(sc, flights))
ml_linear_regression(
flights_tbl
response = "dep_delay",
features = c("arr_delay", "arr_time"))
* Dropped 9430 rows with 'na.omit' (336776 => 327346)
Call: ml_linear_regression(flights_tbl, response = "dep_delay",
features = c("arr_delay", "arr_time"))
Coefficients:
(Intercept) arr_delay arr_time
6.1001212994 0.8210307947 0.0005284729
In addition, dim()
, nrow()
and ncol()
are now supported against Spark DataFrames.
Livy connections
Livy, “An Open Source REST Service for Apache Spark (Apache License)”, is now available in sparklyr 0.5
as an experimental feature. Among many scenarios, this enables connections from the RStudio desktop to Apache Spark when Livy is available and correctly configured in the remote cluster.
Livy running locally
To work with Livy locally, sparklyr
supports livy_install()
which installs Livy in your local environment, this is similar to spark_install()
. Since Livy is a service to enable remote connections into Apache Spark, the service needs to be started with livy_service_start()
. Once the service is running, spark_connect()
needs to reference the running service and use method = "Livy"
, then sparklyr
can be used as usual. A short example follows:
livy_install() livy_service_start() sc <- spark_connect(master = "http://localhost:8998", method = "livy") copy_to(sc, iris) spark_disconnect(sc) livy_service_stop()
Livy running in HDInsight
Microsoft Azure supports Apache Spark clusters configured with Livy and protected with basic authentication in HDInsight clusters. To use sparklyr
with HDInsight clusters through Livy, first create the HDInsight cluster with Spark support:
Creating Spark Cluster in Microsoft Azure HDInsight
Once the cluster is created, you can connect with sparklyr
as follows:
library(sparklyr) library(dplyr) config <- livy_config(user = "admin", password = "password") sc <- spark_connect(master = "https://dm.azurehdinsight.net/livy/", method = "livy", config = config) copy_to(sc, iris)
From a desktop running RStudio, the remote connection looks like this:
Improved connections
sparklyr 0.5
no longer requires internet connectivity to download additional Apache Spark packages. This enables connections in secure clusters that do not have internet access or while on the go.
Some community members reported a generic “Ports file does not exists” error while connecting with sparklyr 0.4
. In 0.5
, we’ve deprecated the ports file and improved error reporting. For instance, the following invalid connection example throws: a descriptive error, the spark-submit
parameters and logging information that helps us troubleshoot connection issues.
> library(sparklyr) > sc <- spark_connect(master = "local", config = list("sparklyr.gateway.port" = "0")) Error in force(code) : Failed while connecting to sparklyr to port (0) for sessionid (5305): Gateway in port (0) did not respond. Path: /spark-1.6.2-bin-hadoop2.6/bin/spark-submit Parameters: --class, sparklyr.Backend, 'sparklyr-1.6-2.10.jar', 0, 5305 ---- Output Log ---- 16/12/12 12:42:35 INFO sparklyr: Session (5305) starting ---- Error Log ----
Additional technical details can be found in the sparklyr gateway socket pull request.
Cloudera certification
sparklyr 0.4, sparklyr 0.5, RStudio Server Pro 1.0 and ShinyServer Pro 1.5 went through Cloudera’s certification and are now certified with Cloudera. Among various benefits, authentication features like Kerberos, have been tested and validated against secured clusters.
For more information see Cloudera’s partner listings.
We’re thrilled to officially introduce the newest product in RStudio’s product lineup: RStudio Connect.
You can download a free 45-day trial of it here.
RStudio Connect is a new publishing platform for all the work your teams do in R. It provides a single destination for your Shiny applications, R Markdown documents, interactive HTML widgets, static plots, and more.
RStudio Connect isn’t just for R users. Now anyone can interact with custom built analytical data products developed by R users without having to program in R themselves. Team members can receive updated reports built on the same models/forecasts which can be configured to be rebuilt and distributed on a scheduled basis. RStudio Connect is designed to bring the power of data science to your entire enterprise.
RStudio Connect empowers analysts to share and manage the content they’ve created in R. Users of the RStudio IDE can publish content to RStudio Connect with the click of a button and immediately be able to manage that content from a user-friendly web application: setting access controls and performance settings and viewing the logs of the associated R processes on the server.
RStudio Connect is on-premises software that you can install on a server behind your firewall ensuring that your data and R applications never have to leave your organization’s control. We integrate with many enterprise authentication platform including LDAP/Active Directory, Google OAuth, PAM, and proxied authentication. We also provide an option to use an internal username/password system complete with user self-sign-up.
RStudio Connect has been in Beta for almost a year. We’ve had hundreds of customers validate and help us improve the software in that time. In November, we made RStudio Connect generally available without significant fanfare and began to work with Beta participants and existing RStudio customers eager to move it into their production environments. We are pleased that innovative early customers, like AdRoll, have already successfully introduced RStudio Connect into their data science process.
“At AdRoll, we have used the open source version of Shiny Server for years to great success but deploying apps always served as a barrier for new users. With RStudio Connect’s push button deployment from the RStudio IDE, the number of shiny devs has grown tremendously both in engineering and across teams completely new to shiny like finance and marketing. It’s been really powerful for those just getting started to be able to go from developing locally to sharing apps with others in just seconds.”
– Bryan Galvin, Senior Data Scientist, AdRoll
We invite you to take a look at RStudio Connect today, too!
You can find more details or download a 45 day evaluation of the product at https://www.rstudio.com/products/connect/. Additional resources can be found below.