172.7k yurts. Assuming that these are family residences for the most part, if we take an average occupancy of 4 (which is probably too low - the fertility rate is still quite high there) gives ~691k people living in yurts - approximately 20% of the population of 3.5 million - sounds reasonable.
From my memory: 3 million people, 1.5 living in the capital.
Let's say 1 million are living outside cities.
4 people per yurt.
250,000 yurt.
Add some extra yurts because there will be people having more than one or people living in a house with a yurt in the garden or yurts used as warehouses, etc
300,000 which is almost the double of the count from the ML app.
To start, OSM doesn't use Google Maps imagery for annotation due to licensing concerns. As someone else mentioned, it's rarely clear whether researchers have the right to use Maps imagery let alone download/re-publish it. Part of the reason is that Google sub-licenses imagery from several different providers who are usually extremely protective of IP. So immediately you'd have image/label alignment issues.
Even if you had access to the image that someone used for labeling, it's non-trivial. They might not even have used an image! For example you might walk around and take a GPS reading next to every object and use the keypoints as object centers. Sometimes the annotation quality is low, for example if you want to try using building outlines or roads as segmentation targets for aerial imagery. Or things are simply misaligned. Also since yurts are inherently mobile, you might not even be able to use those labels because objects have moved and there's no guarantee they'll be present in Google Maps.
Finally you'd have issues of omission/commission, because you would have to assume that OSM is complete. That's very sensitive to how active the local community is. Some places are accurate down to the fire hydrant. Where I live, there are plenty of unmapped businesses that have been here for years. Though you could definitely use it to cross-check your own labels + predictions.
The standard for detecting objects on tiles is to discard border predictions and rely on overlap (sliding window) prediction + non max suppression (NMS) to handle duplicates. The overlap is usually something like 1x receptive field of your model, and your "discard" region is a bit larger than your max expected object size.
172k of them? That still seems like quite a lot of yurts; certainly more yurts per capita than anyone else has.
Living away from other people and not next to anything in particular is what I associate with nomads, the heuristic of searching a radius around landmarks doesn't make sense to me. I scrolled around a random remote desert area in Mongolia on Google Maps and found a yurt every couple of minutes.
https://taginfo.geofabrik.de/asia:mongolia/tags/building=ger
I'm also guessing your model doesn't handle yurts that are on the border of a tile.
Finally, that's a much smaller number than I expected for a country of 3 million.