From experience, I agree with your points. One thing OSM data is ok for is land classification labels ("landuse", etc tags) as the accuracy is not as important at their scales and requires less effort to cleanup. Most of the work is aggregating disparate landuses into buckets that make sense for your model.
To start, OSM doesn't use Google Maps imagery for annotation due to licensing concerns. As someone else mentioned, it's rarely clear whether researchers have the right to use Maps imagery let alone download/re-publish it. Part of the reason is that Google sub-licenses imagery from several different providers who are usually extremely protective of IP. So immediately you'd have image/label alignment issues.
Even if you had access to the image that someone used for labeling, it's non-trivial. They might not even have used an image! For example you might walk around and take a GPS reading next to every object and use the keypoints as object centers. Sometimes the annotation quality is low, for example if you want to try using building outlines or roads as segmentation targets for aerial imagery. Or things are simply misaligned. Also since yurts are inherently mobile, you might not even be able to use those labels because objects have moved and there's no guarantee they'll be present in Google Maps.
Finally you'd have issues of omission/commission, because you would have to assume that OSM is complete. That's very sensitive to how active the local community is. Some places are accurate down to the fire hydrant. Where I live, there are plenty of unmapped businesses that have been here for years. Though you could definitely use it to cross-check your own labels + predictions.
The standard for detecting objects on tiles is to discard border predictions and rely on overlap (sliding window) prediction + non max suppression (NMS) to handle duplicates. The overlap is usually something like 1x receptive field of your model, and your "discard" region is a bit larger than your max expected object size.