We’re currently in Maintenance mode on OSgrid. This means the grid is offline. Long technical story short, our storage cluster was seemingly running out of space. Our cluster uses ceph. This is the object storage daemon for the Ceph distributed file system. It manages data on local storage with redundancy and provides access to that data over the network. This daemon writes your assets to disk in tiny packages. Now OSgrid has 17 years of data which is approx several hundred million assets.
Ceph, at its core, doesn’t have a “chunk size”. There is a RADOS layer which stores objects as a whole, without any splitting or striping. It reserves space to do so. The space it allocates versus what it writes is apparently not the same. Now what the backend-team ran into is that somehow the server thinks its out of space, whilst there is like 6 TB free. We noticed increased disk usage a few weeks back already. It took a while to figure why that was happening, as everything looks perfectly normal. It seems the asset data and its redundant copy are not being stored at their actual size. So to the OS every asset looks like it is bigger than it actually is.
To fix the above the OSD’s need to be reconfigured by moving millions of assets.The cluster can than perform a scrub to free up space so the data can be re-written. Freeing up space it thinks is occupied to become actual “free space” again. This is a pain to do when data is constantly edited or added. It would take forever and give a bad user experience if we enable the grid during these operations. Not to mention chances of corruption and failures. Hence the maintenance mode. Once we have sufficient progress we can open provisionally while there will be a sliding window of conversion (few million at a time) running in the background until this operation completes. Waiting for this initial scrub to finish to get back in operations is why we’re waiting.
Quick Q&A
What are OSD’s ? Object Storage Daemons. OSDs manage data on local storage devices, such as hard disk drives (HDDs) or solid state disks (SSDs), and provide redundancy.
ETA on fix ? None as yet. Can put a nice story here, but would be assumptions. ASAP for as far as we are concerned. We just don’t want to break or overlook stuff by being hasty, nor putting things back online too soon and bog down the process. We can put it online, but experience has thought us that just because things are possible, doesn’t necessarily mean they’re the smart thing to do.
Open provisionally ? This just means it will be possible that select certain assets may become unavailable for a short time when the conversion process runs across them. You will still be able to use the grid and build, new assets will not be affected.
Why does it take so long ? It takes a while because its just a lot of data, and the machine does what it can. Visualize the experience is like staring at a 1995 de-fragmentation screen. Take a packet, load it, rewrite it, store it, take next one. And that for every asset we have. We’d love to speed up the process, but cannot. The cluster is 100% dedicated to this task. It does not need to retrieve or consider new data, does not need to sync new things. It takes the time it needs. The backend team does what it can, really nobody likes downtime. It’s operational issues you run into, and they just take a lot of time, because its a lot of data. Since most we do is all pioneering, we sometimes find out about these things the hard way. It’s not even an excuse, just an observation.
Did we loose assets / are we going to, due to this operation? Nothing should get lost! Technically. Loading and rewriting data and saving it again should not break things under normal operations. We were aware of the issue, and shut it down before it went haywire.
Does this have impact on other opensim installs for the future ? Highly unlikely. Most grids will never grow to a size where they need such backend infrastructure, and those that do won’t have technical limitations OSgrid inherited from ‘having a history’ and running solitary on donations. Also if best practice configurations come out of this, (which I don’t think as to the best of my knowledge it’s really database specific) it’s for the benefit of all opensim users. Bugs in opensim we’d encounter get reported to the dev’s and patched before they get rolled out to you or your grid.
Why do you wait 3 days before telling us this ? We apologize for the initial lack of elaborate communications. This topic should have been up here earlier. But if something happens to the grid, you will always find a notice on X first, after that we update Facebook, and other channels. On this page here we try to keep you informed with more background info, as we have /get it. For the rest, can just blame Foxx….
Anything else ? Just as a heads up, more maintenance is planned q4/q1, as the board decided in it’s last meeting some of the plaza servers will be phased out to save costs. This should not involve considerable / noticeable downtime, as most its used for is backups, tests, isolation, and regions like the OSG-Birthday ones. Things might move around a bit as a result.