A lot of things to do lately in the realm of business. Seems like it never stops. I have a project for a popular magazine. Not going to stay what it handles, who it’s for – but let’s just say it hired a programmer once upon a time, and that programmer made a HUGE mess in which nearly 18GB (in a single folder) is currently sitting on server, with no-easy-way-to-clean-it-out.
Now I recently inherited this issue, and it’s a doozy! Now there are like 200 different ways while napping I can think of to organize images. Not claiming to be the smartest man on the planet, but let’s just go over how this is done.
First information comes into the magazine, a bulk amount of data is imported, each article has image assets so let’s say article number #1492 has 5 images attached to it to gather and store-on-server, perfectly fine. The import gives me those images in an image that looks like this:
Now I have no control over naming conventions so of course I too would take and find a way to keep track of images, so maybe a unique ID like oh I don’t know the article ID + filename so something like:
This way should I ever need to remove images I can script out just about ANYTHING to take that article number and purge the file system of it should I desire to remove the article, no assets left over, life is happy, all is good! But sadly in this case each night the data file comes back in, same image names attached to something I didn’t want anymore, including some articles which expire by default, so assets can get mucky VERY quickly just by nature.
The oddity is the guy who programmed did something just plain odd by how my thinking works – as he inputted the assets into the database when MySQL inserted it to the uploads for media to track, he appended the ID of the image, not the ID of the article. Fine right? Wrong. No checking.
So now a random image ID generated by it’s place in line has literally (and this isn’t exaggerating) about 125 entries in the file system, for a single file, oh but wait it creates a print, thumbnail versions too – so 375 images (all the same file) – and mind you if an article has say 5-10 images – I just laughed so hard I pee’d a little. Even on the short end if only 5 images (which many have more) we’d be looking at 1,875 images. Duplicated. Un-needed.
Oh, there’s more. It’s spread out across an INSANE file structure, so let’s say the image ID was 4985 like I noted above, it would upload that file & it’s adjusted brothers and sisters to the following folder in the structure:
4985 = uploads/4/9/8/4985-f83r5713209sdf9834.jpg
So these duplicates are spread out across a HUGE file structure, now thankfully I know I can delete a TON because the current picture ID’s short of a few 190’s is in the 200k’s so just the 0’s and 1’s? 18.2GB. Wasted space. I did a check of the recent imports? So many duplicates and the entire image folder of uploads/9/ is 47GB in storage. Top that off with one of the last imported? Has references back to the 1’s folder! 03d2bedbe4c84229b5263785d61f432c is the filename (.jpg) but it has references back when it’s image ID was 121! Oi!
The big chore now is writing a tool which will cross reference everything in the file structure against the DB, really clear out the garbage and get things running smoothly again. I have a meeting with the guy who tossed this project in my lap, while I love hours solving these stupid problems, I too also like sleep, low stress, and not having the file system fill up with so much garbage it runs out of storage and breaks the data import in the AM, I am tired of problems, I want happy solutions.
So busy. I need sleep.
Examining 358356 files, 111 dirs (in 1 specified)
209598 duplicate files (in 88853 sets), occupying 10715 MB (aka 10.715GB)
Note this is the number 1 folder. That’s not including 0,2-9 (as well as some test folders for what may have been meant to be article ID’s)
Examining 2317950 files, 3157 dirs (in 1 specified)
2104868 duplicate files (in 182205 sets), occupying 119834 MB (aka 119.834GB)
All folders scanned. Horrific waste of space. This guy was being paid an arm and leg to get this code up to snuff, and all he would have had to do was make the prefix the article ID and he’d of had an insanely EASY way to prevent duplicates, even if he overwrote it each time, no duplicates, always would = article ID first. Oi. Now the fun task, can I re-write themes & such to match the article ID prefix? or do I go with my “Let me re-build this on top of WordPress?” idea. Hrmm. I’ll sleep on it.