You hit the nail on the head, Rich. What you pointed out is exactly correct: there are many different combinations of hue and saturation that can lead to the same gray value (... or luminance value, or brightness value, etc.).
The only way that this or any other algorithm has a chance of assigning colors correctly is if it is smart enough to classify various areas in an image based on factors such as their shape, size, texture, boundary shape & complexity, etc. If it can pull that off (which it apparently can, with some accuracy), then, for example, it won't often assign a dull green to the bottom of a storm cloud, or a light Caucasian skin tone to a moderately bright area in a sunset photo, even though they both may have the same gray value.
If their many-layered (i.e., "deep learning") neural net algorithm is smart enough (and has enough exemplars to train on), then theoretically, it might eventually even be able to distinguish between similar areas (eg, a evergreen vs a deciduous forest) and color them appropriately.
After all, this is almost exactly how we manually colorize B&W photos: If we are observant and have a good artistic sense, we will likely have a good idea what colors to use in many areas of the image, and take our best guess at the areas we know nothing about (eg, clothing that could be almost any color, but the same luminance value).
So, my idea was to generate very difficult test cases for their algorithm that would indirectly tax / test its feature recognition abilities.
HTH,
Tom M
PS - @
Rich54 - I just re-read my post and realized that my use of the term, "best case scenario" may have been confusing. This was only meant to imply that if their algorithm couldn't do well with perfectly exposed, well focused, well lit, moderate contrast photos, then it would surely do much worse with average quality photos. It wasn't meant to imply that I thought that the results would actually be good -- just that they would show off the best that the algorithm could possibly achieve.