Apologies if crossposting is against the rules; I’m not entirely sure where the lines are drawn here yet.

I posted this in lemmy.fosscad (fosscad@lemmy?), but realize that may not be the most active venue.

I grabbed archives of fosscad and took a look at the contents of the zst’s. I think I could probably rebuild the contents of the subreddit in some manner or another; the question is scale and hosting. How would we make the posts easily searchable, where would they live, what endpoint can we upload hundreds of thousands of comments into in a reasonable time frame… all that fun stuff.

The archives don’t contain pictures, but contain links to the pictures and the ones I’ve checked are currently still live (meaning the pics are still hosted on reddit). Dunno how long that will remain the case.

I have no idea what the size of the archives would be with pics downloaded; gigs, a TB, no clue. I’m posting this to gauge public interest and I haven’t done much preliminary work (oh, these are json. Yep, dictionaries work. Wingo.)

Is there any interest in making this more publicly available? I’ve run into an issue with a particular build and I’ll be diving through the archives to fix it for my self. It seems like a shame that all this information would be inaccessible to everyone who isn’t able or interested in trawling through their own local archives.

I’m not a programmer by trade, but work in an adjacent space. I can plink along on this if other people are interested (and if anyone is interested enough to help pitch in, even better).

  • hoglegOP
    link
    fedilink
    English
    arrow-up
    2
    ·
    4 days ago

    Awesome! https://github.com/hoglegcc/rebuildtheark

    I was able to add the submissions zst, but I’m running into a file limit size to add the comments archive. Compressed, the comments are around 35MB (480 uncompressed) and github has a limit of 25MB. Worst case scenario I can get it to you another way.

    I’m technically supposed to be doing something else right now, but I should be able to get the comments issue squared in the next couple of hours. Also, dunno if you want a private repo for the actual bot; don’t know if you keys you’re worried about or anything like that. I can dump a private repo, or if you want to make one and add me, either way works. Or if public repo works for you, that works for me, too.

    Thanks!

    • Grey Summit Gear@fosscad.io
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 days ago

      Thanks. I gotta maybe make a new GitHub account if I don’t wanna doxx myself.

      But I will clone it in the meantime.

      • hoglegOP
        link
        fedilink
        English
        arrow-up
        2
        ·
        4 days ago

        I was able to get the comments uploaded after splitting them in 7zip. After combining with 7zip, hashes still match and first couple of items looked good to go.

        Also threw a python module and example loop in there, and an example comment and submission item in the readme.

        I was thinking of making a class for the threads, another class for comments and making each submission an instance of the thread class, comments as instances in a list for their parent thread so we can track comment author and utc for chronology.

        I can build out classes and assigning comments to threads. I imagine you’re going to be much faster at all of this than I am, so whenever you’ve got the image scrape ready, let me know and I’ll start grabbing my share of the links.

        • The ShittinatorMA
          link
          fedilink
          English
          arrow-up
          2
          ·
          4 days ago

          Is there more to this archive? Posts seem to cut off right at Sat Dec 31 2022 23:42:15 GMT+0000, which is suspiciously close to the end of a year.

          • hoglegOP
            link
            fedilink
            English
            arrow-up
            3
            ·
            3 days ago

            There should be more; they may be split into several differential files. I’m going to work on getting the others right now, but I’ve got limited time before I have to leave for work. If I can’t get them up in time, I’ve been using resources like https://academictorrents.com/details/ba051999301b109eab37d16f027b3f49ade2de13/tech&filelist=1 (if url’s can’t be posted, it’s academictorrents dot com, posts by Watchful1) I could have sworn I grabbed newer data for the fosscad subreddit specifically, but may have to go through the monthly diffs in this link and pull out anything under the /fosscad id.