Tuesday, April 15, 2008

The case of the unloved files

I've been investigating a strange problem on one of my servers. It started out as a simple case of NetBackup failing to back the filesystem up.

Now that's not entirely unusual - NetBackup often goes off and sulks in a corner. But this was rather different, as it didn't disappear as mysteriously as it came. Rather, it stayed put and the filesystem repeatedly refused to complete a backup. And the diagnostics are pretty much non-existent.

OK, so after a week or so of this I decide to try an alternative approach. To my surprise, I couldn't blame NetBackup.

First attempt was to try ufsdump. It started off in a promising manner, then froze completely on me.

OK, that's not good. So I try various attempts at tar. So various tar commands, writing to a remote tape or filesystem. That would work, right? Wrong!

Each attempt freezes completely on me. That's local tar piped to tar that writes over nfs; tar piped to an rsh; tar on an nfs client; tar using rmt to write to a tape.

That's odd. Now, at least I can look at the output and see how far it's got. I'm starting to make headway, as it looks like it gets to the same point each time.

OK, so I start to build a copy by tarring up various bits of the filesystem, avoiding the place where I know it breaks. Until I get into the suspect area. And yes, it still fails in the same place (but at least I've got a copy of most of the filesystem now, so can breathe easier).

The bad area is in an area that has various versions (views of the data) of the index files of a proprietary search engine. Now it looks like tar always traverses the hierarchy in the same order. OK, so if I manually list the subdirectories in an order that puts the failed one last, I can copy off the files I'm missing, right?

That was a fine theory until it froze on me again. And this is where it gets really strange. Each subdirectory has the same structure. So in each subdirectory there's a bunch of files with different suffixes. And it always fails on one particular suffix. Furthermore, it fails at about the same distance (about 38 of 40 megabytes). That's about as weird as it gets, in my experience. What on earth is there about these files that causes anything that tries to back these files up to lock up completely?

And it gets worse. I can cp this file locally. Try cp to any nfs-mounted location and the cp wedges. An rcp to any remote system wedges. And it's the same again - it wedges at the same distance into the file.

It must be something in the data stream that's contained in these files. At least I did find one way to copy them across - if I gzip them first then they go across fine.

But where is the bug that's being tickled? Is this something in the network stack?

3 comments:

Chad Mynhier said...

Hmm, I had a user come to me with something very similar a couple of months ago. I'll try to dig up the details and see if I can reproduce the problem.

Chad Mynhier said...

Okay, I tracked down the details, and even though it was every bit as weird, it wasn't quite the same.

In my case, a user had a script which made a database connection, disconnected, and then used rsync (over rsh) to pull a log file over from the database server. The log file would successfully transfer, but rsync would report a prematurely-terminated connection. Without opening and closing the database connection, the rsync reported no error.

(The user found a work-around, and I didn't have time to probe deeper, so I don't know what was going on.)

Peter Tribble said...

Talking to some people the other night, this does look like a network issue - something is interpreting the sequence of bytes and getting confused.

Having checked the systems I have available to me, this problem only affects one system. If that system is involved at all, things go bad. Unfortunately that system is the primary fileserver...