Data W(e)ave On The Beach: La fonction merge/ The merge function

Il existe, dans la plupart des langages non typés, une méthode qui permet de fusionner deux objets. Cette fusion consiste à reporter tous les champs d'un objet o2 pour les inclure dans l'objet source o1. Dataweave a cette particularité que l'on peut avoir au sein du même objet, des champs qui partagent le même nom. Ainsi:

%dw 2.0

output application/json

var p = {

    "message": "Hello monde!",

    "message": "Hello world!"

}

---

payload.*message

qui donne comme résultat:

[
  "Hello monde!",
  "Hello world!"
]

Nous allons voir que cela peut poser problème. Ainsi, prenons le cas d'un objet décrivant une personne avec son adresse et que nous désirons mettre à jour. Premier réflexe : utiliser l'opérateur "++" qui sur le papier, effectue cette fusion:

%dw 2.0

output application/json

var person = {

    "firstname": "Harry",

    "lastname": "Torkovsky",

    "address1": "3 Hotton Street",

    "address2": "Somewhere",

    "postalcode": "1234"

}

---

person ++ {

    "address1": "2 Avenue Hermonsa",

    "address2": "Elsewhere",

    "postalcode": "4321",

    "phone": "06872354"

}

Le résultat est un peu surprenant:

{
  "firstname": "Harry",
  "lastname": "Torkovsky",
  "address1": "3 Hotton Street",
  "address2": "Somewhere",
  "postalcode": "1234",
  "address1": "2 Avenue Hermonsa",
  "address2": "Elsewhere",
  "postalcode": "4321",
  "phone": "06872354"
}

Ce n'est clairement pas le comportement souhaité. Nous aurions voulu remplacer les champs concernant l'adresse. Pour cela, la solution la plus évidente est d'utiliser l'opérateur "--" en lui passant la liste des champs à remplacer. Ainsi, notre traitement commence par supprimer les champs partagés avant d'insérer les nouvelles versions:

%dw 2.0
output application/json
var person = {
    "firstname": "Harry",
    "lastname": "Torkovsky",
    "address1": "3 Hotton Street",
    "address2": "Somewhere",
    "postalcode": "1234"
}
var repl = {
    "address1": "2 Avenue Hermonsa",
    "address2": "Elsewhere",
    "postalcode": "4321",
    "phone": "06872354"
}
---
person -- keysOf(repl) ++ repl

Cette fois, le résultat est conforme à nos attentes:

{
  "firstname": "Harry",
  "lastname": "Torkovsky",
  "address1": "2 Avenue Hermonsa",
  "address2": "Elsewhere",
  "postalcode": "4321",
  "phone": "06872354"
}

Problème résolu ? Ben... pas sûr. Si vous avez suivi les posts de ce blog, vous savez que l'opérateur "--" de complexité N(O2), peut poser des problèmes de performance. Donc, que ce passe-t-il si les objets à fusionner sont grands, voire très grands, genre: des champs par milliers ? Pour répondre à cette question, j'ai écrit deux implémentations de la méthode "merge". La première utilise "--", la seconde reconstruit l'objet par filtrage (voir post précédant pour de plus amples explications sur ces deux démarches) Nous allons les comparer:

%dw 2.8
output application/json
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
fun merge0(a, b)=(a -- keysOf(b)) ++ b
fun merge1(a, b)=(
    (a pluck {k:$$, v:$} 
        filter isEmpty(b[$.k])
    ) ++ (b pluck {k:$$, v:$})
)
reduce (l, a={})->a ++ (l.k):l.v
---
[
	eval(()->
    	merge0(vars.obj, vars.obj)
	),
	eval(()->
    	merge1(vars.obj, vars.obj)
	)
]

vars.obj est un objet en implémentation JAVA (donc une hashmap) qui contient 25000 champs. "eval" exécute une lambda en loggant le temps qui a été nécessaire pour cela. Voici le résultat (le test a été exécuté quatre fois):

INFO  ... DefaultLoggingService$: 30543
INFO  ... DefaultLoggingService$: 651
INFO  ... DefaultLoggingService$: 29536
INFO  ... DefaultLoggingService$: 247
INFO  ... DefaultLoggingService$: 28971
INFO  ... DefaultLoggingService$: 115
INFO  ... DefaultLoggingService$: 30943
INFO  ... DefaultLoggingService$: 168

30 secondes pour l'implémentation qui utilise "--" contre une centaine de millisecondes pour l'autre, soir un rapport de 200 (pour 25000 lignes, donc). Cela confirme ce que nous savions. La messe est-elle dite ? Eh bien... c'est à nuancer. Car sur de petits objets, l'opérateur "--" et plus efficace ! Ainsi, nous allons utiliser nos deux implémentations sur de petits objets, mais de très nombreuses fois:

%dw 2.8
output application/json
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
fun merge0(a, b)=(a -- keysOf(b)) ++ b
fun merge1(a, b)=(
    (a pluck {k:$$, v:$} 
        filter isEmpty(b[$.k])
    ) ++ (b pluck {k:$$, v:$})
)
reduce (l, a={})->a ++ (l.k):l.v
---
[
	eval(()->
    	(0 to 10000) map merge0(vars.obj, vars.obj)
	),
	eval(()->
    	(0 to 10000) map merge1(vars.obj, vars.obj)
	)
]

Les résultats montrent, grosso modo que l'opérateur "--" est cette fois le plus efficace. Il l'est d'un facteur 1,5 à 2. C'est moins important que pour les gros objets, mais ce peut être non négligeable si on parcourt des listes immenses à merger :

INFO  ... DefaultLoggingService$: 1018
INFO  ... DefaultLoggingService$: 1898
INFO  ... DefaultLoggingService$: 444
INFO  ... DefaultLoggingService$: 701
INFO  ... DefaultLoggingService$: 194
INFO  ... DefaultLoggingService$: 395
INFO  ... DefaultLoggingService$: 218
INFO  ... DefaultLoggingService$: 339

Noter que Dataweave propose une méthode "mergeWith" disponible à partir de la version 2.8 de DataWeave. L'existence de cette méthode pour l'instant me semble réduite ... à sa documentation (?). Je ne suis pas arrivé à la faire accepter par Mulesoft (même dans la version 4.8.0 du serveur).

_________________________________________________________________________

Most non-typed languages have a method for merging two objects. This merge consists in carrying over all the fields of an object o2 and including them in the source object o1. Dataweave's special feature is that, within the same object, you can have fields that share the same name. For example:

%dw 2.0

output application/json

var p = {

    "message": "Hello monde!",

    "message": "Hello world!"

}

---

payload.*message

which results in:

[
  "Hello monde!",
  "Hello world!"
]

We'll see how this can cause problems. Let's take the case of an object describing a person and their address, which we want to update. Our first instinct is to use the “++” operator, which, on paper, performs this merge:

%dw 2.0

output application/json

var person = {

    "firstname": "Harry",

    "lastname": "Torkovsky",

    "address1": "3 Hotton Street",

    "address2": "Somewhere",

    "postalcode": "1234"

}

---

person ++ {

    "address1": "2 Avenue Hermonsa",

    "address2": "Elsewhere",

    "postalcode": "4321",

    "phone": "06872354"

}

The result is a little surprising:

{
  "firstname": "Harry",
  "lastname": "Torkovsky",
  "address1": "3 Hotton Street",
  "address2": "Somewhere",
  "postalcode": "1234",
  "address1": "2 Avenue Hermonsa",
  "address2": "Elsewhere",
  "postalcode": "4321",
  "phone": "06872354"
}

This is clearly not the desired behavior.

We would have liked to replace the address fields.To do this, the most obvious solution is to use the “--” operator, passing it the list of fields to be replaced.In this way, our processing begins by deleting the shared fields before inserting the new versions:

%dw 2.0
output application/json
var person = {
    "firstname": "Harry",
    "lastname": "Torkovsky",
    "address1": "3 Hotton Street",
    "address2": "Somewhere",
    "postalcode": "1234"
}
var repl = {
    "address1": "2 Avenue Hermonsa",
    "address2": "Elsewhere",
    "postalcode": "4321",
    "phone": "06872354"
}
---
person -- keysOf(repl) ++ repl

This time, the result is as expected:

{
  "firstname": "Harry",
  "lastname": "Torkovsky",
  "address1": "2 Avenue Hermonsa",
  "address2": "Elsewhere",
  "postalcode": "4321",
  "phone": "06872354"
}

Problem solved? Well... not sure. If you've been following the posts on this blog, you'll know that the “--” operator of complexity N(O2) can cause performance problems. So, what happens if the objects to be merged are large, or even very large, like thousands of fields? To answer this question, I've written two implementations of the “merge” method. The first uses “--”, the second reconstructs the object by filtering (see previous post for further explanation of these two approaches). Let's compare them:

%dw 2.8
output application/json
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
fun merge0(a, b)=(a -- keysOf(b)) ++ b
fun merge1(a, b)=(
    (a pluck {k:$$, v:$} 
        filter isEmpty(b[$.k])
    ) ++ (b pluck {k:$$, v:$})
)
reduce (l, a={})->a ++ (l.k):l.v
---
[
	eval(()->
    	merge0(vars.obj, vars.obj)
	),
	eval(()->
    	merge1(vars.obj, vars.obj)
	)
]

vars.obj is a JAVA object (i.e. a hashmap) containing 25,000 fields.

“eval” executes a lambda, logging the time it took to do so.

Here's the result (the test was run four times):

INFO  ... DefaultLoggingService$: 30543
INFO  ... DefaultLoggingService$: 651
INFO  ... DefaultLoggingService$: 29536
INFO  ... DefaultLoggingService$: 247
INFO  ... DefaultLoggingService$: 28971
INFO  ... DefaultLoggingService$: 115
INFO  ... DefaultLoggingService$: 30943
INFO  ... DefaultLoggingService$: 168

30 seconds for the implementation that uses “--” versus a hundred milliseconds for the other, a ratio of 200 (for 25,000 lines). This confirms what we already knew. Is it all over? Well... not quite. For small objects, the “--” operator is more efficient! So we're going to use our two implementations on small objects, but lots and lots of times:

%dw 2.8
output application/json
fun Now() = now() then (t)->(t as Number)*1000 + t.milliseconds
fun eval(prs) = 
    Now() then (t1)-> prs() then (r)->
    log(Now() -t1)
    then r
fun merge0(a, b)=(a -- keysOf(b)) ++ b
fun merge1(a, b)=(
    (a pluck {k:$$, v:$} 
        filter isEmpty(b[$.k])
    ) ++ (b pluck {k:$$, v:$})
)
reduce (l, a={})->a ++ (l.k):l.v
---
[
	eval(()->
    	(0 to 10000) map merge0(vars.obj, vars.obj)
	),
	eval(()->
    	(0 to 10000) map merge1(vars.obj, vars.obj)
	)
]

The results show, roughly speaking, that the “--” operator is the most efficient this time.

It is by a factor of 1.5 to 2. This is less important than for large objects, but can be significant if you're browsing huge merger lists:

INFO  ... DefaultLoggingService$: 1018
INFO  ... DefaultLoggingService$: 1898
INFO  ... DefaultLoggingService$: 444
INFO  ... DefaultLoggingService$: 701
INFO  ... DefaultLoggingService$: 194
INFO  ... DefaultLoggingService$: 395
INFO  ... DefaultLoggingService$: 218
INFO  ... DefaultLoggingService$: 339

Please note that Dataweave offers a “mergeWith” method available from DataWeave version 2.8 onwards. The existence of this method for the time being seems limited to its documentation (?). I haven't managed to get Mulesoft to accept it (even in server version 4.8.0).

Data W(e)ave On The Beach

mercredi 15 janvier 2025

La fonction merge/ The merge function

Aucun commentaire:

Enregistrer un commentaire

Pourquoi ce blog ? / Why this blog?

Signaler un abus