Sunday, October 12, 2008

Rescue Your JFS Partition With jfsrec

So, my 500GB deskstar (aka DeathStar) on my file server developed bad sectors. The original rescue plan was simple - use dd_rescue to clone the disk image. Normally, that should get a pretty good result for a drive going bad recently.

The problem was, I forgot my IDE2USB adapter does not support harddrive that big, and the adapter just bailed out on reading unrecoverable sector. So, I did it the other way - mounted the harddrive as read-only, and rsync directory by directory.

To speed up the file transfer and skip files that are publicly available, I remount the drive as read-write, without any backup copy (that I tell you, is a big mistake that I would probably not make in the workplace - but I get a bit lax on my private stuff). After an rm some_distro.iso, bang... the Kernel panicked!

I rebooted the machine, remounted the harddrive. Now, it's even worse - the mounted JFS volume is empty - zero file!

Since all my personal photos, music, documents, source codes, whatever-you-can-think-of, are all stored in that drive, losing all the data would be such a pain. I realized I have to proceed with caution on that point forward. So, here's what I do:

Step 1: Buy another drive (1TB baby!), mounted both drives to the machine (avoid the RAID card getting to clever and initialize both drives!); boot the box with Knoppix, and clone the drives (be careful of the drive order!)

Step 2: Make another copy of the cloned image (remember that the original drive already developed bad sectors, it is possible that I cannot make a disk image as good as this one.)

Step 3: Running fsck.jfs on the drive, and found the following error message:
Duplicate block references have been detected in
Metadata. CANNOT CONTINUE.
It means, fsck.jfs is not going to do it this time, and the corruption is down to the metadata... Oh snap!

Step 4: Google for JFS structure or any other recovery too (besides TCK/Sleuthkit... it is too much hassle), found a great tool called jfsrec

Step 5: Follow jfsrec's documentation, and run something like this:
./jfsrec --device /path/to/my/disk.image --output /path/to/recovery/directory --logdir /path/to/log.dir
According to the doc, the process could take days, but mine was pretty good - it just took me 24 hours.

Step 6: Now I have most files recovered in /path/to/recovery/directory. Even some files are corrupted, and some filenames are lost (they are named with the inode number), it is already much better than having it all gone. Still, here's another problem - jfsrec does not handle UTF-8 filenames properly yet (Announcing jfsrec - A JFS recovery tool: msg#00008), so I end up with a number of files with garbled names.

Step 7: Fortunately, there is another useful tool to fix this issue. That is... "drum roll please"... convmv. As I said in my last step, jfsrec does not handle UTF-8 filenames properly. So, the filenames are actually UTF-8 interpreted as ISO8859-1. The idea is to reverse the process, so I did the checking with:
cd path/to/garbled/filenames
convmv -f utf-8 -t iso8859-1 *
And the result looks pretty good, so I issue another command to do the work:
convmv -f utf-8 -t iso8859-1 * --notest
Followup...
  1. Check the file integrity, and see which files are corrupted.
  2. Build a RAID-1 system, even if it is a soft-RAID (yes, mine is a software RAID controller card, due to tight budget).
  3. Backup... frequently; preferably offsite backup too (an easy way is to store the backup in my office. Make sure it is encrypted, because it is not nice to give other possibilities to see my bank statements, etc).
  4. and a few other steps...
Improvements & Things Learned
  1. Did I say "backup often"?
  2. Invest in redundancy solution (with regards to cost of data loss vs. price for redundancy configuration).
  3. Follow steps carefully, even if it is my home machine.
  4. Avoid making important decision, when I didn't have enough sleep (I had a heartbreak, and sleepless for days, when I managed to corrupt the JFS volume).
  5. Not to put too much trust on these cheap IDE2USB cables. Also, document and remember its limitation (e.g. maximum size supported).
  6. Offer improvements to the jfsrec project on utf-8 support (when my mood gets better, and I make up my mind).

Wednesday, October 1, 2008

Managing Windows Service / Reboot On Linux / *nix

相信有管理Linux一定經驗的朋友,都對Linux或Unix下的遙距系統管理不陌生了。就算沒有開啟其他服務,單靠ssh己經可以做到很多系統自動化的工作。

至於Linux對Windows的管理,很多人都知道循Samba來做檔案交換管理,但就是少提到用Samba附帶的net工具管理Windows上的service。

要使用samba管理Windows,系統需要安裝:
  • samba-common
  • samba-client

例如要重啟Windows server,可以在Linux下:
net rpc shutdown -r -U <myusername> -I <WinServerIPAddr>
如果只要關機的話,可以把-r參數拿掉

又例如要開啟Windows上的系統服務,可以用:
net rpc service start tomcat6 -U <myusername> -I <WinServerIPAddr>

其他還有很多技巧,自己看documentation吧,我也不在此重覆了。